KenLM: Faster and Smaller Language Model Queries

Size: px
Start display at page:

Download "KenLM: Faster and Smaller Language Model Queries"

Transcription

1 KenLM: Faster and Smaller Language Model Queries Kenneth Carnegie Mellon July 30, 2011 kheafield.com/code/kenlm

2 What KenLM Does Answer language model queries using less time and memory. log p(<s> iran) = log p(<s> iran is ) = log p(<s> iran is one) = log p(<s> iran is one of ) = log p( iran is one of the ) = log p( is one of the few ) =

3 Related Work Downloadable Baselines SRI Popular and considered fast but high-memory IRST Open source, low-memory, single-threaded Rand Low-memory lossy compression MIT Mostly estimates models but also does queries Papers Without Code TPT Better memory locality Sheffield Lossy compression techniques

4 Related Work Downloadable Baselines SRI Popular and considered fast but high-memory IRST Open source, low-memory, single-threaded Rand Low-memory lossy compression MIT Mostly estimates models but also does queries Papers Without Code TPT Better memory locality Sheffield Lossy compression techniques After KenLM s Public Release Berkeley Java; slower and larger than KenLM

5 Why I Wrote KenLM Decoding takes too long Answer queries quickly Load quickly with memory mapping Thread-safe

6 Why I Wrote KenLM Decoding takes too long Answer queries quickly Load quickly with memory mapping Thread-safe Bigger models Conserve memory

7 Why I Wrote KenLM Decoding takes too long Answer queries quickly Load quickly with memory mapping Thread-safe Bigger models Conserve memory SRI doesn t compile Distribute and compile with decoders

8 Outline State 1 State 2 Probing 3 Perplexity Translation

9 Example Language Model State Unigrams Words log p Back <s> iran is one of Bigrams Words log p Back <s> iran iran is is one one of Trigrams Words log p <s> iran is -1.1 iran is one -2.0 is one of -0.3

10 Example Queries State Unigrams Words log p Back <s> iran is one of Bigrams Words log p Back <s> iran iran is is one one of Trigrams Words log p <s> iran is -1.1 iran is one -2.0 is one of -0.3 Query: <s> iran is log p(<s> iran is) = -1.1 Query: iran is of log p(of) -2.5 Backoff(is) -1.4 Backoff(iran is) log p(iran is of) = -4.3

11 State Lookups Performed by Queries <s> iran is Lookup 1 is 2 iran is 3 <s> iran is Score log p(<s> iran is) = -1.1 Lookup 1 of 2 is of (not found) 3 is 4 iran is Score iran is of log p(of) -2.5 Backoff(is) -1.4 Backoff(iran is) log p(iran is of) = -4.3

12 State Lookups Performed by Queries <s> iran is Lookup 1 is 2 iran is 3 <s> iran is Score log p(<s> iran is) = -1.1 Lookup 1 of 2 is of (not found) 3 is 4 iran is Score iran is of log p(of) -2.5 Backoff(is) -1.4 Backoff(iran is) log p(iran is of) = -4.3

13 State Lookups Performed by Queries <s> iran is Lookup 1 is 2 iran is 3 <s> iran is Score log p(<s> iran is) = -1.1 State Backoff(is) Backoff(iran is) Lookup 1 of 2 is of (not found) 3 is 4 iran is Score iran is of log p(of) -2.5 Backoff(is) -1.4 Backoff(iran is) log p(iran is of) = -4.3

14 Stateful Query Pattern State log p(<s> iran) = -3.3 log p(<s> iran is ) = -1.1 log p( iran is one) = -2.0 log p( is one of ) = -0.3

15 Stateful Query Pattern State Backoff(<s>) log p(<s> iran) = -3.3 Backoff(iran), Backoff(<s> iran) log p(<s> iran is ) = -1.1 Backoff(is), Backoff(iran is) log p( iran is one) = -2.0 Backoff(one), Backoff(is one) log p( is one of ) = -0.3 Backoff(of), Backoff(one of)

16 Probing Probing Fast. Uses hash tables. Small. Uses sorted arrays. Smaller. with compressed pointers. Key Subproblem Sparse lookup: efficiently retrieve values for sparse keys

17 Probing Sparse Lookup Speed 100 Lookups/µs 10 1 probing hash set unordered interpolation binary search set Entries

18 Probing Sparse Lookup Speed 100 Lookups/µs 10 1 probing hash set unordered interpolation binary search set Entries

19 Probing Linear Probing Hash Table Store 64-bit hashes and ignore collisions. Bigrams Words Hash log p Back <s> iran 0xf0ae9c2442c6920e iran is 0x959e48455f4a2e is one 0x186a7caef34acf one of 0xac db8dac

20 Probing Linear Probing Hash Table 1.5 buckets/entry (so buckets = 6). Ideal bucket = hash mod buckets. Resolve bucket collisions using the next free bucket. Bigrams Words Ideal Hash log p Back iran is 0 0x959e48455f4a2e x0 0 0 is one 2 0x186a7caef34acf one of 2 0xac db8dac <s> iran 4 0xf0ae9c2442c6920e x0 0 0 Array

21 Probing Probing Data Structure Unigrams Words log p Back <s> iran is one of Array Bigrams Words log p Back <s> iran iran is is one one of Probing Hash Table Trigrams Words log p <s> iran is -1.1 iran is one -2.0 is one of -0.3 Probing Hash Table

22 Probing Probing Hash Table Summary Hash tables are fast. But memory is 24 bytes/entry. Next: Saving memory with.

23 Probing Uses Sorted Arrays Sort in suffix order. Unigrams Words log p Back Ptr <s> iran is one of Bigrams Words log p Back Ptr <s> iran iran is one is <s> one is one one of Trigrams Words log p <s> iran is -1.1 <s> one is -2.3 iran is one -2.0 <s> one of -0.5 is one of -0.3

24 Probing Sort in suffix order. Encode suffix using pointers. Unigrams Words log p Back Ptr <s> iran is one of Array Bigrams Words log p Back Ptr <s> iran <s> is iran is one is <s> one is one one of Array Trigrams Words log p <s> iran is -1.1 <s> one is -2.3 iran is one -2.0 <s> one of -0.5 is one of -0.3 Array

25 Probing Interpolation Search In Interpolation Search O(log log n) Each trie node is a sorted array. Bigrams: * is Words log p Back Ptr <s> is iran is one is key A.first pivot = A A.last A.first Binary Search: O(logn) pivot = A 2

26 Probing Saving Memory with Bit-Level Packing Store word index and pointer using the minimum number of bits. Optional Quantization Cluster floats into 2 q bins, store q bits/float (same as IRSTLM).

27 Probing : Compress Pointers Bigrams Words log p Back Ptr <s> iran iran is one is <s> one is one one of Increasing

28 Probing : Compress Pointers Offset Ptr Binary Raj and Whittaker (2003)

29 Probing : Compress Pointers Offset Ptr Binary ped Offset 1 6 Raj and Whittaker (2003)

30 Probing : Compress Pointers Offset Ptr Binary ped Offset Raj and Whittaker (2003)

31 Probing / Summary Save memory: bit packing, quantization, and pointer compression.

32 Outline Perplexity Translation 1 State 2 Probing 3 Perplexity Translation

33 Perplexity Task Perplexity Translation Score the English Gigaword corpus. Model SRILM 5-gram from Europarl + De-duplicated News Crawl Measurements Queries/ms Loaded Memory Peak Memory Excludes loading and file reading time Resident after loading Peak virtual after scoring

34 Perplexity Task: Exact Models 2000 Ken Probing Perplexity Translation 1500 Queries/ms 1000 Ken SRI 500 Ken IRST Inverted IRST SRI Compact MIT 0 Loaded Peak Memory (GB)

35 Perplexity Translation Perplexity Task: Berkeley Always Quantizes to 19 bits 2000 Ken Probing 1500 Queries/ms Ken Ken IRST Inverted 19 IRST Berkeley Compress 19 Berkeley Scroll Berkeley Hash SRI Compact MIT Memory (GB) SRI

36 Perplexity Translation Perplexity Task: RandLM from an ARPA file 2000 Ken Probing 1500 Queries/ms Ken 8 Ken 8 Berkeley Scroll Berkeley Hash IRST Inverted 19 IRST SRI Compact Berkeley Compress 19 8 Rand Backoff p(false) = 2 8 MIT Memory (GB) SRI

37 Translation Task Perplexity Translation Translate 3003 sentences using Moses. System WMT 2011 French-English baseline, Europarl+News LM Measurements Time Total wall time, including loading Memory Total resident memory after decoding

38 Moses Benchmarks: 8 Threads 3 4 Perplexity Translation 8 Rand Backoff 2 8 false Time (h) Probing SRI Memory (GB)

39 Perplexity Translation Moses Benchmarks: Single Threaded 4 8 Rand Backoff 2 8 false 3 Time (h) IRST SRI 8 8 Probing Memory (GB)

40 Perplexity Translation Comparison to RandLM (Unpruned Model, One Thread) 6 5 Rand Backoff p(false) = 2 8 p(false) = 2 10 Time (h) Rand Stupid Backoff p(false) = 2 8 p(false) = Memory (GB)

41 Conclusion Perplexity Translation Maximize speed and accuracy subject to memory. Probing > > > RandLM Stupid for both speed and memory. Distributed with decoders: Moses cdec Joshua file KLanguageModel use kenlm=true kheafield.com/code/kenlm/

N-Gram Language Modelling including Feed-Forward NNs. Kenneth Heafield. University of Edinburgh

N-Gram Language Modelling including Feed-Forward NNs. Kenneth Heafield. University of Edinburgh N-Gram Language Modelling including Feed-Forward NNs Kenneth Heafield University of Edinburgh History of Language Model History Kenneth Heafield University of Edinburgh 3 p(type Predictive) > p(tyler Predictive)

More information

Natural Language Processing

Natural Language Processing Natural Language Processing Language Models Language models are distributions over sentences N gram models are built from local conditional probabilities Language Modeling II Dan Klein UC Berkeley, The

More information

Efficient Data Structures for Massive N-Gram Datasets

Efficient Data Structures for Massive N-Gram Datasets Efficient ata Structures for Massive N-Gram atasets Giulio Ermanno Pibiri University of Pisa and ISTI-CNR Pisa, Italy giulio.pibiri@di.unipi.it Rossano Venturini University of Pisa and ISTI-CNR Pisa, Italy

More information

Faster and Smaller N-Gram Language Models

Faster and Smaller N-Gram Language Models Faster and Smaller N-Gram Language Models Adam Pauls Dan Klein Computer Science Division University of California, Berkeley {adpauls,klein}@csberkeleyedu Abstract N-gram language models are a major resource

More information

CS 288: Statistical NLP Assignment 1: Language Modeling

CS 288: Statistical NLP Assignment 1: Language Modeling CS 288: Statistical NLP Assignment 1: Language Modeling Due September 12, 2014 Collaboration Policy You are allowed to discuss the assignment with other students and collaborate on developing algorithms

More information

Text Analytics. Index-Structures for Information Retrieval. Ulf Leser

Text Analytics. Index-Structures for Information Retrieval. Ulf Leser Text Analytics Index-Structures for Information Retrieval Ulf Leser Content of this Lecture Inverted files Storage structures Phrase and proximity search Building and updating the index Using a RDBMS Ulf

More information

Text Analytics. Index-Structures for Information Retrieval. Ulf Leser

Text Analytics. Index-Structures for Information Retrieval. Ulf Leser Text Analytics Index-Structures for Information Retrieval Ulf Leser Content of this Lecture Inverted files Storage structures Phrase and proximity search Building and updating the index Using a RDBMS Ulf

More information

Storing the Web in Memory: Space Efficient Language Models with Constant Time Retrieval

Storing the Web in Memory: Space Efficient Language Models with Constant Time Retrieval Storing the Web in Memory: Space Efficient Language Models with Constant Time Retrieval David Guthrie Computer Science Department University of Sheffield D.Guthrie@dcs.shef.ac.uk Mark Hepple Computer Science

More information

Efficient Data Structures for Massive N -Gram Datasets

Efficient Data Structures for Massive N -Gram Datasets Efficient Data Structures for Massive N -Gram Datasets Giulio Ermanno Pibiri Dept. of Computer Science, University of Pisa, Italy giulio.pibiri@di.unipi.it ABSTRACT The efficient indexing of large and

More information

Hash Tables Outline. Definition Hash functions Open hashing Closed hashing. Efficiency. collision resolution techniques. EECS 268 Programming II 1

Hash Tables Outline. Definition Hash functions Open hashing Closed hashing. Efficiency. collision resolution techniques. EECS 268 Programming II 1 Hash Tables Outline Definition Hash functions Open hashing Closed hashing collision resolution techniques Efficiency EECS 268 Programming II 1 Overview Implementation style for the Table ADT that is good

More information

Advanced Algorithmics (6EAP) MTAT Hashing. Jaak Vilo 2016 Fall

Advanced Algorithmics (6EAP) MTAT Hashing. Jaak Vilo 2016 Fall Advanced Algorithmics (6EAP) MTAT.03.238 Hashing Jaak Vilo 2016 Fall Jaak Vilo 1 ADT asscociative array INSERT, SEARCH, DELETE An associative array (also associative container, map, mapping, dictionary,

More information

Indexing. UCSB 290N. Mainly based on slides from the text books of Croft/Metzler/Strohman and Manning/Raghavan/Schutze

Indexing. UCSB 290N. Mainly based on slides from the text books of Croft/Metzler/Strohman and Manning/Raghavan/Schutze Indexing UCSB 290N. Mainly based on slides from the text books of Croft/Metzler/Strohman and Manning/Raghavan/Schutze All slides Addison Wesley, 2008 Table of Content Inverted index with positional information

More information

Scalable Trigram Backoff Language Models

Scalable Trigram Backoff Language Models Scalable Trigram Backoff Language Models Kristie Seymore Ronald Rosenfeld May 1996 CMU-CS-96-139 School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213 This material is based upon work

More information

Outline GIZA++ Moses. Demo. Steps Output files. Training pipeline Decoder

Outline GIZA++ Moses. Demo. Steps Output files. Training pipeline Decoder GIZA++ and Moses Outline GIZA++ Steps Output files Moses Training pipeline Decoder Demo GIZA++ A statistical machine translation toolkit used to train IBM Models 1-5 (moses only uses output of IBM Model-1)

More information

Chapter 5: Physical Database Design. Designing Physical Files

Chapter 5: Physical Database Design. Designing Physical Files Chapter 5: Physical Database Design Designing Physical Files Technique for physically arranging records of a file on secondary storage File Organizations Sequential (Fig. 5-7a): the most efficient with

More information

Hashing IV and Course Overview

Hashing IV and Course Overview Date: April 5-6, 2001 CSI 2131 Page: 1 Hashing IV and Course Overview Other Collision Resolution Techniques 1) Double Hashing The first hash function determines the home address If the home address is

More information

Chapter IR:IV. IV. Indexes. Inverted Indexes Query Processing Index Construction Index Compression

Chapter IR:IV. IV. Indexes. Inverted Indexes Query Processing Index Construction Index Compression Chapter IR:IV IV. Indexes Inverted Indexes Query Processing Index Construction Index Compression IR:IV-267 Indexes HAGEN/POTTHAST/STEIN 2018 Inverted Indexes Index [ANSI/NISO 1997] An index is a systematic

More information

AAL 217: DATA STRUCTURES

AAL 217: DATA STRUCTURES Chapter # 4: Hashing AAL 217: DATA STRUCTURES The implementation of hash tables is frequently called hashing. Hashing is a technique used for performing insertions, deletions, and finds in constant average

More information

System Structure Revisited

System Structure Revisited System Structure Revisited Naïve users Casual users Application programmers Database administrator Forms DBMS Application Front ends DML Interface CLI DDL SQL Commands Query Evaluation Engine Transaction

More information

A Space-Efficient Phrase Table Implementation Using Minimal Perfect Hash Functions

A Space-Efficient Phrase Table Implementation Using Minimal Perfect Hash Functions A Space-Efficient Phrase Table Implementation Using Minimal Perfect Hash Functions Marcin Junczys-Dowmunt Faculty of Mathematics and Computer Science Adam Mickiewicz University ul. Umultowska 87, 61-614

More information

PROCESS VIRTUAL MEMORY. CS124 Operating Systems Winter , Lecture 18

PROCESS VIRTUAL MEMORY. CS124 Operating Systems Winter , Lecture 18 PROCESS VIRTUAL MEMORY CS124 Operating Systems Winter 2015-2016, Lecture 18 2 Programs and Memory Programs perform many interactions with memory Accessing variables stored at specific memory locations

More information

SILT: A Memory-Efficient, High- Performance Key-Value Store

SILT: A Memory-Efficient, High- Performance Key-Value Store SILT: A Memory-Efficient, High- Performance Key-Value Store SOSP 11 Presented by Fan Ni March, 2016 SILT is Small Index Large Tables which is a memory efficient high performance key value store system

More information

Algorithms for NLP. Language Modeling II. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley

Algorithms for NLP. Language Modeling II. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley Algorithms for NLP Language Modeling II Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley Announcements Should be able to really start project after today s lecture Get familiar with bit-twiddling

More information

PERFORMANCE ANALYSIS AND OPTIMIZATION OF SKIP LISTS FOR MODERN MULTI-CORE ARCHITECTURES

PERFORMANCE ANALYSIS AND OPTIMIZATION OF SKIP LISTS FOR MODERN MULTI-CORE ARCHITECTURES PERFORMANCE ANALYSIS AND OPTIMIZATION OF SKIP LISTS FOR MODERN MULTI-CORE ARCHITECTURES Anish Athalye and Patrick Long Mentors: Austin Clements and Stephen Tu 3 rd annual MIT PRIMES Conference Sequential

More information

Research Students Lecture Series 2015

Research Students Lecture Series 2015 Research Students Lecture Series 215 Analyse your big data with this one weird probabilistic approach! Or: applied probabilistic algorithms in 5 easy pieces Advait Sarkar advait.sarkar@cl.cam.ac.uk Research

More information

Module 3: Hashing Lecture 9: Static and Dynamic Hashing. The Lecture Contains: Static hashing. Hashing. Dynamic hashing. Extendible hashing.

Module 3: Hashing Lecture 9: Static and Dynamic Hashing. The Lecture Contains: Static hashing. Hashing. Dynamic hashing. Extendible hashing. The Lecture Contains: Hashing Dynamic hashing Extendible hashing Insertion file:///c /Documents%20and%20Settings/iitkrana1/My%20Documents/Google%20Talk%20Received%20Files/ist_data/lecture9/9_1.htm[6/14/2012

More information

Introducing Hashing. Chapter 21. Copyright 2012 by Pearson Education, Inc. All rights reserved

Introducing Hashing. Chapter 21. Copyright 2012 by Pearson Education, Inc. All rights reserved Introducing Hashing Chapter 21 Contents What Is Hashing? Hash Functions Computing Hash Codes Compressing a Hash Code into an Index for the Hash Table A demo of hashing (after) ARRAY insert hash index =

More information

N-gram language models for massively parallel devices

N-gram language models for massively parallel devices N-gram language models for massively parallel devices Nikolay Bogoychev and Adam Lopez University of Edinburgh Edinburgh, United Kingdom Abstract For many applications, the query speed of N-gram language

More information

Lecture 16. Reading: Weiss Ch. 5 CSE 100, UCSD: LEC 16. Page 1 of 40

Lecture 16. Reading: Weiss Ch. 5 CSE 100, UCSD: LEC 16. Page 1 of 40 Lecture 16 Hashing Hash table and hash function design Hash functions for integers and strings Collision resolution strategies: linear probing, double hashing, random hashing, separate chaining Hash table

More information

LANGUAGE MODEL SIZE REDUCTION BY PRUNING AND CLUSTERING

LANGUAGE MODEL SIZE REDUCTION BY PRUNING AND CLUSTERING LANGUAGE MODEL SIZE REDUCTION BY PRUNING AND CLUSTERING Joshua Goodman Speech Technology Group Microsoft Research Redmond, Washington 98052, USA joshuago@microsoft.com http://research.microsoft.com/~joshuago

More information

Moses version 2.1 Release Notes

Moses version 2.1 Release Notes Moses version 2.1 Release Notes Overview The document is the release notes for the Moses SMT toolkit, version 2.1. It describes the changes to toolkit since version 1.0 from January 2013. The aim during

More information

HASH TABLES. Goal is to store elements k,v at index i = h k

HASH TABLES. Goal is to store elements k,v at index i = h k CH 9.2 : HASH TABLES 1 ACKNOWLEDGEMENT: THESE SLIDES ARE ADAPTED FROM SLIDES PROVIDED WITH DATA STRUCTURES AND ALGORITHMS IN C++, GOODRICH, TAMASSIA AND MOUNT (WILEY 2004) AND SLIDES FROM JORY DENNY AND

More information

CS6200 Information Retrieval. David Smith College of Computer and Information Science Northeastern University

CS6200 Information Retrieval. David Smith College of Computer and Information Science Northeastern University CS6200 Information Retrieval David Smith College of Computer and Information Science Northeastern University Indexing Process!2 Indexes Storing document information for faster queries Indexes Index Compression

More information

Efficiency. Efficiency: Indexing. Indexing. Efficiency Techniques. Inverted Index. Inverted Index (COSC 488)

Efficiency. Efficiency: Indexing. Indexing. Efficiency Techniques. Inverted Index. Inverted Index (COSC 488) Efficiency Efficiency: Indexing (COSC 488) Nazli Goharian nazli@cs.georgetown.edu Difficult to analyze sequential IR algorithms: data and query dependency (query selectivity). O(q(cf max )) -- high estimate-

More information

CHAPTER 9 HASH TABLES, MAPS, AND SKIP LISTS

CHAPTER 9 HASH TABLES, MAPS, AND SKIP LISTS 0 1 2 025-612-0001 981-101-0002 3 4 451-229-0004 CHAPTER 9 HASH TABLES, MAPS, AND SKIP LISTS ACKNOWLEDGEMENT: THESE SLIDES ARE ADAPTED FROM SLIDES PROVIDED WITH DATA STRUCTURES AND ALGORITHMS IN C++, GOODRICH,

More information

Please note that some of the resources used in this assignment require a Stanford Network Account and therefore may not be accessible.

Please note that some of the resources used in this assignment require a Stanford Network Account and therefore may not be accessible. Please note that some of the resources used in this assignment require a Stanford Network Account and therefore may not be accessible. CS 224N / Ling 237 Programming Assignment 1: Language Modeling Due

More information

Data Representation and Indexing

Data Representation and Indexing Data Representation and Indexing Kai Shen Data Representation Data structure and storage format that represents, models, or approximates the original information What we ve seen/learned so far? Raw byte

More information

Chapter 27 Hashing. Liang, Introduction to Java Programming, Eleventh Edition, (c) 2017 Pearson Education, Inc. All rights reserved.

Chapter 27 Hashing. Liang, Introduction to Java Programming, Eleventh Edition, (c) 2017 Pearson Education, Inc. All rights reserved. Chapter 27 Hashing 1 Objectives To know what hashing is for ( 27.3). To obtain the hash code for an object and design the hash function to map a key to an index ( 27.4). To handle collisions using open

More information

University of Waterloo CS240 Spring 2018 Help Session Problems

University of Waterloo CS240 Spring 2018 Help Session Problems University of Waterloo CS240 Spring 2018 Help Session Problems Reminder: Final on Wednesday, August 1 2018 Note: This is a sample of problems designed to help prepare for the final exam. These problems

More information

Smoothing. BM1: Advanced Natural Language Processing. University of Potsdam. Tatjana Scheffler

Smoothing. BM1: Advanced Natural Language Processing. University of Potsdam. Tatjana Scheffler Smoothing BM1: Advanced Natural Language Processing University of Potsdam Tatjana Scheffler tatjana.scheffler@uni-potsdam.de November 1, 2016 Last Week Language model: P(Xt = wt X1 = w1,...,xt-1 = wt-1)

More information

Web Information Retrieval. Lecture 4 Dictionaries, Index Compression

Web Information Retrieval. Lecture 4 Dictionaries, Index Compression Web Information Retrieval Lecture 4 Dictionaries, Index Compression Recap: lecture 2,3 Stemming, tokenization etc. Faster postings merges Phrase queries Index construction This lecture Dictionary data

More information

Introduction. hashing performs basic operations, such as insertion, better than other ADTs we ve seen so far

Introduction. hashing performs basic operations, such as insertion, better than other ADTs we ve seen so far Chapter 5 Hashing 2 Introduction hashing performs basic operations, such as insertion, deletion, and finds in average time better than other ADTs we ve seen so far 3 Hashing a hash table is merely an hashing

More information

Computer Science 302 Fall 2009 (Practice) Second Examination, October 15, 2009

Computer Science 302 Fall 2009 (Practice) Second Examination, October 15, 2009 omputer Science 0 Fall 009 (Practice) Second xamination, October 1, 009 Name: The entire examination is 0 points. 1. True or False. [ points each] (a) (b) A good programmer should never use linear search.

More information

1 Probability Review. CS 124 Section #8 Hashing, Skip Lists 3/20/17. Expectation (weighted average): the expectation of a random quantity X is:

1 Probability Review. CS 124 Section #8 Hashing, Skip Lists 3/20/17. Expectation (weighted average): the expectation of a random quantity X is: CS 124 Section #8 Hashing, Skip Lists 3/20/17 1 Probability Review Expectation (weighted average): the expectation of a random quantity X is: x= x P (X = x) For each value x that X can take on, we look

More information

CSE100. Advanced Data Structures. Lecture 21. (Based on Paul Kube course materials)

CSE100. Advanced Data Structures. Lecture 21. (Based on Paul Kube course materials) CSE100 Advanced Data Structures Lecture 21 (Based on Paul Kube course materials) CSE 100 Collision resolution strategies: linear probing, double hashing, random hashing, separate chaining Hash table cost

More information

Indexing: Overview & Hashing. CS 377: Database Systems

Indexing: Overview & Hashing. CS 377: Database Systems Indexing: Overview & Hashing CS 377: Database Systems Recap: Data Storage Data items Records Memory DBMS Blocks blocks Files Different ways to organize files for better performance Disk Motivation for

More information

Indexing and Searching

Indexing and Searching Indexing and Searching Introduction How to retrieval information? A simple alternative is to search the whole text sequentially Another option is to build data structures over the text (called indices)

More information

iems Interactive Experiment Management System Final Report

iems Interactive Experiment Management System Final Report iems Interactive Experiment Management System Final Report Pēteris Ņikiforovs Introduction Interactive Experiment Management System (Interactive EMS or iems) is an experiment management system with a graphical

More information

Storage hierarchy. Textbook: chapters 11, 12, and 13

Storage hierarchy. Textbook: chapters 11, 12, and 13 Storage hierarchy Cache Main memory Disk Tape Very fast Fast Slower Slow Very small Small Bigger Very big (KB) (MB) (GB) (TB) Built-in Expensive Cheap Dirt cheap Disks: data is stored on concentric circular

More information

Indexing. Jan Chomicki University at Buffalo. Jan Chomicki () Indexing 1 / 25

Indexing. Jan Chomicki University at Buffalo. Jan Chomicki () Indexing 1 / 25 Indexing Jan Chomicki University at Buffalo Jan Chomicki () Indexing 1 / 25 Storage hierarchy Cache Main memory Disk Tape Very fast Fast Slower Slow (nanosec) (10 nanosec) (millisec) (sec) Very small Small

More information

Adding Source Code Searching Capability to Yioop

Adding Source Code Searching Capability to Yioop Adding Source Code Searching Capability to Yioop Advisor - Dr Chris Pollett Committee Members Dr Sami Khuri and Dr Teng Moh Presented by Snigdha Rao Parvatneni AGENDA Introduction Preliminary work Git

More information

CS159 - Assignment 2b

CS159 - Assignment 2b CS159 - Assignment 2b Due: Tuesday, Sept. 23 at 2:45pm For the main part of this assignment we will be constructing a number of smoothed versions of a bigram language model and we will be evaluating its

More information

CS 525: Advanced Database Organization 04: Indexing

CS 525: Advanced Database Organization 04: Indexing CS 5: Advanced Database Organization 04: Indexing Boris Glavic Part 04 Indexing & Hashing value record? value Slides: adapted from a course taught by Hector Garcia-Molina, Stanford InfoLab CS 5 Notes 4

More information

Introduction to Information Retrieval

Introduction to Information Retrieval Introduction to Information Retrieval http://informationretrieval.org IIR 5: Index Compression Hinrich Schütze Center for Information and Language Processing, University of Munich 2014-04-17 1/59 Overview

More information

Lecture 18. Collision Resolution

Lecture 18. Collision Resolution Lecture 18 Collision Resolution Introduction In this lesson we will discuss several collision resolution strategies. The key thing in hashing is to find an easy to compute hash function. However, collisions

More information

New Concept based Indexing Technique for Search Engine

New Concept based Indexing Technique for Search Engine Indian Journal of Science and Technology, Vol 10(18), DOI: 10.17485/ijst/2017/v10i18/114018, May 2017 ISSN (Print) : 0974-6846 ISSN (Online) : 0974-5645 New Concept based Indexing Technique for Search

More information

More B-trees, Hash Tables, etc. CS157B Chris Pollett Feb 21, 2005.

More B-trees, Hash Tables, etc. CS157B Chris Pollett Feb 21, 2005. More B-trees, Hash Tables, etc. CS157B Chris Pollett Feb 21, 2005. Outline B-tree Domain of Application B-tree Operations Hash Tables on Disk Hash Table Operations Extensible Hash Tables Multidimensional

More information

Chapter 9: Maps, Dictionaries, Hashing

Chapter 9: Maps, Dictionaries, Hashing Chapter 9: 0 1 025-612-0001 2 981-101-0002 3 4 451-229-0004 Maps, Dictionaries, Hashing Nancy Amato Parasol Lab, Dept. CSE, Texas A&M University Acknowledgement: These slides are adapted from slides provided

More information

CS 124/LING 180/LING 280 From Languages to Information Week 2: Group Exercises on Language Modeling Winter 2018

CS 124/LING 180/LING 280 From Languages to Information Week 2: Group Exercises on Language Modeling Winter 2018 CS 124/LING 180/LING 280 From Languages to Information Week 2: Group Exercises on Language Modeling Winter 2018 Dan Jurafsky Tuesday, January 23, 2018 1 Part 1: Group Exercise We are interested in building

More information

arxiv: v1 [cs.ir] 25 Jun 2018

arxiv: v1 [cs.ir] 25 Jun 2018 Handling Massive N -Gram Datasets Efficiently ariv:806.09447v [cs.ir] 25 Jun 208 GIULIO ERMNNO PIBIRI and ROSSNO VENTURINI, University of Pisa and ISTI-NR, Italy bstract. This paper deals with the two

More information

Distributed N-Gram Language Models: Application of Large Models to Automatic Speech Recognition

Distributed N-Gram Language Models: Application of Large Models to Automatic Speech Recognition Karlsruhe Institute of Technology Department of Informatics Institute for Anthropomatics Distributed N-Gram Language Models: Application of Large Models to Automatic Speech Recognition Christian Mandery

More information

Today: Finish up hashing Sorted Dictionary ADT: Binary search, divide-and-conquer Recursive function and recurrence relation

Today: Finish up hashing Sorted Dictionary ADT: Binary search, divide-and-conquer Recursive function and recurrence relation Announcements HW1 PAST DUE HW2 online: 7 questions, 60 points Nat l Inst visit Thu, ok? Last time: Continued PA1 Walk Through Dictionary ADT: Unsorted Hashing Today: Finish up hashing Sorted Dictionary

More information

CARNEGIE MELLON UNIVERSITY DEPT. OF COMPUTER SCIENCE DATABASE APPLICATIONS

CARNEGIE MELLON UNIVERSITY DEPT. OF COMPUTER SCIENCE DATABASE APPLICATIONS CARNEGIE MELLON UNIVERSITY DEPT. OF COMPUTER SCIENCE 15-415 DATABASE APPLICATIONS C. Faloutsos Indexing and Hashing 15-415 Database Applications http://www.cs.cmu.edu/~christos/courses/dbms.s00/ general

More information

CSE 332 Winter 2015: Midterm Exam (closed book, closed notes, no calculators)

CSE 332 Winter 2015: Midterm Exam (closed book, closed notes, no calculators) _ UWNetID: Lecture Section: A CSE 332 Winter 2015: Midterm Exam (closed book, closed notes, no calculators) Instructions: Read the directions for each question carefully before answering. We will give

More information

Desktop Crawls. Document Feeds. Document Feeds. Information Retrieval

Desktop Crawls. Document Feeds. Document Feeds. Information Retrieval Information Retrieval INFO 4300 / CS 4300! Web crawlers Retrieving web pages Crawling the web» Desktop crawlers» Document feeds File conversion Storing the documents Removing noise Desktop Crawls! Used

More information

LoonyBin: Making Empirical MT Reproduciple, Efficient, and Less Annoying

LoonyBin: Making Empirical MT Reproduciple, Efficient, and Less Annoying LoonyBin: Making Empirical MT Reproduciple, Efficient, and Less Annoying Jonathan Clark Alon Lavie CMU Machine Translation Lunch Tuesday, January 19, 2009 Outline A (Brief) Guilt Trip What goes into a

More information

CSE 373 Autumn 2012: Midterm #2 (closed book, closed notes, NO calculators allowed)

CSE 373 Autumn 2012: Midterm #2 (closed book, closed notes, NO calculators allowed) Name: Sample Solution Email address: CSE 373 Autumn 0: Midterm # (closed book, closed notes, NO calculators allowed) Instructions: Read the directions for each question carefully before answering. We may

More information

Fast Evaluation of Union-Intersection Expressions

Fast Evaluation of Union-Intersection Expressions Fast Evaluation of Union-Intersection Expressions Philip Bille Anna Pagh Rasmus Pagh IT University of Copenhagen Data Structures for Intersection Queries Preprocess a collection of sets S 1,..., S m independently

More information

Title. the key value index optimized for size and speed. 1

Title. the key value index optimized for size and speed.   1 Title the key value index optimized for size and speed Title 2 alue ndex based on finite state (FST, immutable data structure) Opensource (Apache 2.0) written in C++(core), Python(binding) keyvi @CLIQZ

More information

Fast Lookup: Hash tables

Fast Lookup: Hash tables CSE 100: HASHING Operations: Find (key based look up) Insert Delete Fast Lookup: Hash tables Consider the 2-sum problem: Given an unsorted array of N integers, find all pairs of elements that sum to a

More information

13 File Structures. Source: Foundations of Computer Science Cengage Learning. Objectives After studying this chapter, the student should be able to:

13 File Structures. Source: Foundations of Computer Science Cengage Learning. Objectives After studying this chapter, the student should be able to: 13 File Structures 13.1 Source: Foundations of Computer Science Cengage Learning Objectives After studying this chapter, the student should be able to: Define two categories of access methods: sequential

More information

Automatic Speech Recognition (ASR)

Automatic Speech Recognition (ASR) Automatic Speech Recognition (ASR) February 2018 Reza Yazdani Aminabadi Universitat Politecnica de Catalunya (UPC) State-of-the-art State-of-the-art ASR system: DNN+HMM Speech (words) Sound Signal Graph

More information

Understand how to deal with collisions

Understand how to deal with collisions Understand the basic structure of a hash table and its associated hash function Understand what makes a good (and a bad) hash function Understand how to deal with collisions Open addressing Separate chaining

More information

CS 216 Fall 2007 Final Exam Page 1 of 10 Name: ID:

CS 216 Fall 2007 Final Exam Page 1 of 10 Name:  ID: Page 1 of 10 Name: Email ID: You MUST write your name and e-mail ID on EACH page and bubble in your userid at the bottom of EACH page including this page. If you do not do this, you will receive a zero

More information

Integer Algorithms and Data Structures

Integer Algorithms and Data Structures Integer Algorithms and Data Structures and why we should care about them Vladimír Čunát Department of Theoretical Computer Science and Mathematical Logic Doctoral Seminar 2010/11 Outline Introduction Motivation

More information

Last Class Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications

Last Class Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications Last Class Carnegie Mellon Univ. Dept. of Computer Science 15-415/615 - DB Applications C. Faloutsos A. Pavlo Lecture#12: External Sorting (R&G, Ch13) Static Hashing Extendible Hashing Linear Hashing Hashing

More information

Introduction to Information Retrieval (Manning, Raghavan, Schutze)

Introduction to Information Retrieval (Manning, Raghavan, Schutze) Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 3 Dictionaries and Tolerant retrieval Chapter 4 Index construction Chapter 5 Index compression Content Dictionary data structures

More information

Kathleen Durant PhD Northeastern University CS Indexes

Kathleen Durant PhD Northeastern University CS Indexes Kathleen Durant PhD Northeastern University CS 3200 Indexes Outline for the day Index definition Types of indexes B+ trees ISAM Hash index Choosing indexed fields Indexes in InnoDB 2 Indexes A typical

More information

IMPORTANT: Circle the last two letters of your class account:

IMPORTANT: Circle the last two letters of your class account: Spring 2011 University of California, Berkeley College of Engineering Computer Science Division EECS MIDTERM I CS 186 Introduction to Database Systems Prof. Michael J. Franklin NAME: STUDENT ID: IMPORTANT:

More information

Stream-based Statistical Machine Translation. Abby D. Levenberg

Stream-based Statistical Machine Translation. Abby D. Levenberg Stream-based Statistical Machine Translation Abby D. Levenberg Doctor of Philosophy School of Informatics University of Edinburgh 2011 Abstract We investigate a new approach for SMT system training within

More information

Database Applications (15-415)

Database Applications (15-415) Database Applications (15-415) DBMS Internals- Part IV Lecture 14, March 10, 015 Mohammad Hammoud Today Last Two Sessions: DBMS Internals- Part III Tree-based indexes: ISAM and B+ trees Data Warehousing/

More information

Some Practice Problems on Hardware, File Organization and Indexing

Some Practice Problems on Hardware, File Organization and Indexing Some Practice Problems on Hardware, File Organization and Indexing Multiple Choice State if the following statements are true or false. 1. On average, repeated random IO s are as efficient as repeated

More information

Algorithms and Data Structures

Algorithms and Data Structures Lesson 4: Sets, Dictionaries and Hash Tables Luciano Bononi http://www.cs.unibo.it/~bononi/ (slide credits: these slides are a revised version of slides created by Dr. Gabriele D Angelo)

More information

DS ,21. L11-12: Hashmap

DS ,21. L11-12: Hashmap Indian Institute of Science Bangalore, India भ रत य व ज ञ न स स थ न ब गल र, भ रत Department of Computational and Data Sciences DS286 2016-09-16,21 L11-12: Hashmap Yogesh Simmhan s i m m h a n @ c d s.

More information

How SPICE Language Modeling Works

How SPICE Language Modeling Works How SPICE Language Modeling Works Abstract Enhancement of the Language Model is a first step towards enhancing the performance of an Automatic Speech Recognition system. This report describes an integrated

More information

University of Waterloo CS240R Winter 2018 Help Session Problems

University of Waterloo CS240R Winter 2018 Help Session Problems University of Waterloo CS240R Winter 2018 Help Session Problems Reminder: Final on Monday, April 23 2018 Note: This is a sample of problems designed to help prepare for the final exam. These problems do

More information

STRUKTUR DATA. By : Sri Rezeki Candra Nursari 2 SKS

STRUKTUR DATA. By : Sri Rezeki Candra Nursari 2 SKS STRUKTUR DATA By : Sri Rezeki Candra Nursari 2 SKS Literatur Sjukani Moh., (2007), Struktur Data (Algoritma & Struktur Data 2) dengan C, C++, Mitra Wacana Media Utami Ema. dkk, (2007), Struktur Data (Konsep

More information

Lattice Rescoring for Speech Recognition Using Large Scale Distributed Language Models

Lattice Rescoring for Speech Recognition Using Large Scale Distributed Language Models Lattice Rescoring for Speech Recognition Using Large Scale Distributed Language Models ABSTRACT Euisok Chung Hyung-Bae Jeon Jeon-Gue Park and Yun-Keun Lee Speech Processing Research Team, ETRI, 138 Gajeongno,

More information

Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information

Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information Satoshi Sekine Computer Science Department New York University sekine@cs.nyu.edu Kapil Dalwani Computer Science Department

More information

CS1020 Data Structures and Algorithms I Lecture Note #15. Hashing. For efficient look-up in a table

CS1020 Data Structures and Algorithms I Lecture Note #15. Hashing. For efficient look-up in a table CS1020 Data Structures and Algorithms I Lecture Note #15 Hashing For efficient look-up in a table Objectives 1 To understand how hashing is used to accelerate table lookup 2 To study the issue of collision

More information

Hashing for searching

Hashing for searching Hashing for searching Consider searching a database of records on a given key. There are three standard techniques: Searching sequentially start at the first record and look at each record in turn until

More information

RAID in Practice, Overview of Indexing

RAID in Practice, Overview of Indexing RAID in Practice, Overview of Indexing CS634 Lecture 4, Feb 04 2014 Slides based on Database Management Systems 3 rd ed, Ramakrishnan and Gehrke 1 Disks and Files: RAID in practice For a big enterprise

More information

ADT asscociafve array INSERT, SEARCH, DELETE. Advanced Algorithmics (6EAP) Some reading MTAT Hashing. Jaak Vilo.

ADT asscociafve array INSERT, SEARCH, DELETE. Advanced Algorithmics (6EAP) Some reading MTAT Hashing. Jaak Vilo. Jaak Vilo Advanced Algorithmics (6EAP) MTAT.03.238 Hashing Jaak Vilo 2013 Spring 1 ADT asscociafve array INSERT, SEARCH, DELETE An associa6ve array (also associa6ve container, map, mapping, dic6onary,

More information

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2009 Lecture 6 - Storage and Indexing

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2009 Lecture 6 - Storage and Indexing CSE 544 Principles of Database Management Systems Magdalena Balazinska Winter 2009 Lecture 6 - Storage and Indexing References Generalized Search Trees for Database Systems. J. M. Hellerstein, J. F. Naughton

More information

Raima Database Manager Version 14.1 In-memory Database Engine

Raima Database Manager Version 14.1 In-memory Database Engine + Raima Database Manager Version 14.1 In-memory Database Engine By Jeffrey R. Parsons, Chief Engineer November 2017 Abstract Raima Database Manager (RDM) v14.1 contains an all new data storage engine optimized

More information

A Distributed Hash Table for Shared Memory

A Distributed Hash Table for Shared Memory A Distributed Hash Table for Shared Memory Wytse Oortwijn Formal Methods and Tools, University of Twente August 31, 2015 Wytse Oortwijn (Formal Methods and Tools, AUniversity Distributed of Twente) Hash

More information

CSC 261/461 Database Systems Lecture 17. Fall 2017

CSC 261/461 Database Systems Lecture 17. Fall 2017 CSC 261/461 Database Systems Lecture 17 Fall 2017 Announcement Quiz 6 Due: Tonight at 11:59 pm Project 1 Milepost 3 Due: Nov 10 Project 2 Part 2 (Optional) Due: Nov 15 The IO Model & External Sorting Today

More information

CSE 332 Autumn 2013: Midterm Exam (closed book, closed notes, no calculators)

CSE 332 Autumn 2013: Midterm Exam (closed book, closed notes, no calculators) Name: Email address: Quiz Section: CSE 332 Autumn 2013: Midterm Exam (closed book, closed notes, no calculators) Instructions: Read the directions for each question carefully before answering. We will

More information

Computer Science 136 Spring 2004 Professor Bruce. Final Examination May 19, 2004

Computer Science 136 Spring 2004 Professor Bruce. Final Examination May 19, 2004 Computer Science 136 Spring 2004 Professor Bruce Final Examination May 19, 2004 Question Points Score 1 10 2 8 3 15 4 12 5 12 6 8 7 10 TOTAL 65 Your name (Please print) I have neither given nor received

More information

Quiz 1 Practice Problems

Quiz 1 Practice Problems Introduction to Algorithms: 6.006 Massachusetts Institute of Technology March 7, 2008 Professors Srini Devadas and Erik Demaine Handout 6 1 Asymptotic Notation Quiz 1 Practice Problems Decide whether these

More information