KenLM: Faster and Smaller Language Model Queries

Similar documents
N-Gram Language Modelling including Feed-Forward NNs. Kenneth Heafield. University of Edinburgh

Natural Language Processing

Efficient Data Structures for Massive N-Gram Datasets

Faster and Smaller N-Gram Language Models

CS 288: Statistical NLP Assignment 1: Language Modeling

Text Analytics. Index-Structures for Information Retrieval. Ulf Leser

Text Analytics. Index-Structures for Information Retrieval. Ulf Leser

Storing the Web in Memory: Space Efficient Language Models with Constant Time Retrieval

Efficient Data Structures for Massive N -Gram Datasets

Hash Tables Outline. Definition Hash functions Open hashing Closed hashing. Efficiency. collision resolution techniques. EECS 268 Programming II 1

Advanced Algorithmics (6EAP) MTAT Hashing. Jaak Vilo 2016 Fall

Indexing. UCSB 290N. Mainly based on slides from the text books of Croft/Metzler/Strohman and Manning/Raghavan/Schutze

Scalable Trigram Backoff Language Models

Outline GIZA++ Moses. Demo. Steps Output files. Training pipeline Decoder

Chapter 5: Physical Database Design. Designing Physical Files

Hashing IV and Course Overview

Chapter IR:IV. IV. Indexes. Inverted Indexes Query Processing Index Construction Index Compression

AAL 217: DATA STRUCTURES

System Structure Revisited

A Space-Efficient Phrase Table Implementation Using Minimal Perfect Hash Functions

PROCESS VIRTUAL MEMORY. CS124 Operating Systems Winter , Lecture 18

SILT: A Memory-Efficient, High- Performance Key-Value Store

Algorithms for NLP. Language Modeling II. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley

PERFORMANCE ANALYSIS AND OPTIMIZATION OF SKIP LISTS FOR MODERN MULTI-CORE ARCHITECTURES

Research Students Lecture Series 2015

Module 3: Hashing Lecture 9: Static and Dynamic Hashing. The Lecture Contains: Static hashing. Hashing. Dynamic hashing. Extendible hashing.

Introducing Hashing. Chapter 21. Copyright 2012 by Pearson Education, Inc. All rights reserved

N-gram language models for massively parallel devices

Lecture 16. Reading: Weiss Ch. 5 CSE 100, UCSD: LEC 16. Page 1 of 40

LANGUAGE MODEL SIZE REDUCTION BY PRUNING AND CLUSTERING

Moses version 2.1 Release Notes

HASH TABLES. Goal is to store elements k,v at index i = h k

CS6200 Information Retrieval. David Smith College of Computer and Information Science Northeastern University

Efficiency. Efficiency: Indexing. Indexing. Efficiency Techniques. Inverted Index. Inverted Index (COSC 488)

CHAPTER 9 HASH TABLES, MAPS, AND SKIP LISTS

Please note that some of the resources used in this assignment require a Stanford Network Account and therefore may not be accessible.

Data Representation and Indexing

Chapter 27 Hashing. Liang, Introduction to Java Programming, Eleventh Edition, (c) 2017 Pearson Education, Inc. All rights reserved.

University of Waterloo CS240 Spring 2018 Help Session Problems

Smoothing. BM1: Advanced Natural Language Processing. University of Potsdam. Tatjana Scheffler

Web Information Retrieval. Lecture 4 Dictionaries, Index Compression

Introduction. hashing performs basic operations, such as insertion, better than other ADTs we ve seen so far

Computer Science 302 Fall 2009 (Practice) Second Examination, October 15, 2009

1 Probability Review. CS 124 Section #8 Hashing, Skip Lists 3/20/17. Expectation (weighted average): the expectation of a random quantity X is:

CSE100. Advanced Data Structures. Lecture 21. (Based on Paul Kube course materials)

Indexing: Overview & Hashing. CS 377: Database Systems

Indexing and Searching

iems Interactive Experiment Management System Final Report

Storage hierarchy. Textbook: chapters 11, 12, and 13

Indexing. Jan Chomicki University at Buffalo. Jan Chomicki () Indexing 1 / 25

Adding Source Code Searching Capability to Yioop

CS159 - Assignment 2b

CS 525: Advanced Database Organization 04: Indexing

Introduction to Information Retrieval

Lecture 18. Collision Resolution

New Concept based Indexing Technique for Search Engine

More B-trees, Hash Tables, etc. CS157B Chris Pollett Feb 21, 2005.

Chapter 9: Maps, Dictionaries, Hashing

CS 124/LING 180/LING 280 From Languages to Information Week 2: Group Exercises on Language Modeling Winter 2018

arxiv: v1 [cs.ir] 25 Jun 2018

Distributed N-Gram Language Models: Application of Large Models to Automatic Speech Recognition

Today: Finish up hashing Sorted Dictionary ADT: Binary search, divide-and-conquer Recursive function and recurrence relation

CARNEGIE MELLON UNIVERSITY DEPT. OF COMPUTER SCIENCE DATABASE APPLICATIONS

CSE 332 Winter 2015: Midterm Exam (closed book, closed notes, no calculators)

Desktop Crawls. Document Feeds. Document Feeds. Information Retrieval

LoonyBin: Making Empirical MT Reproduciple, Efficient, and Less Annoying

CSE 373 Autumn 2012: Midterm #2 (closed book, closed notes, NO calculators allowed)

Fast Evaluation of Union-Intersection Expressions

Title. the key value index optimized for size and speed. 1

Fast Lookup: Hash tables

13 File Structures. Source: Foundations of Computer Science Cengage Learning. Objectives After studying this chapter, the student should be able to:

Automatic Speech Recognition (ASR)

Understand how to deal with collisions

CS 216 Fall 2007 Final Exam Page 1 of 10 Name: ID:

Integer Algorithms and Data Structures

Last Class Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications

Introduction to Information Retrieval (Manning, Raghavan, Schutze)

Kathleen Durant PhD Northeastern University CS Indexes

IMPORTANT: Circle the last two letters of your class account:

Stream-based Statistical Machine Translation. Abby D. Levenberg

Database Applications (15-415)

Some Practice Problems on Hardware, File Organization and Indexing

Algorithms and Data Structures

DS ,21. L11-12: Hashmap

How SPICE Language Modeling Works

University of Waterloo CS240R Winter 2018 Help Session Problems

STRUKTUR DATA. By : Sri Rezeki Candra Nursari 2 SKS

Lattice Rescoring for Speech Recognition Using Large Scale Distributed Language Models

Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information

CS1020 Data Structures and Algorithms I Lecture Note #15. Hashing. For efficient look-up in a table

Hashing for searching

RAID in Practice, Overview of Indexing

ADT asscociafve array INSERT, SEARCH, DELETE. Advanced Algorithmics (6EAP) Some reading MTAT Hashing. Jaak Vilo.

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2009 Lecture 6 - Storage and Indexing

Raima Database Manager Version 14.1 In-memory Database Engine

A Distributed Hash Table for Shared Memory

CSC 261/461 Database Systems Lecture 17. Fall 2017

CSE 332 Autumn 2013: Midterm Exam (closed book, closed notes, no calculators)

Computer Science 136 Spring 2004 Professor Bruce. Final Examination May 19, 2004

Quiz 1 Practice Problems

Transcription:

KenLM: Faster and Smaller Language Model Queries Kenneth heafield@cs.cmu.edu Carnegie Mellon July 30, 2011 kheafield.com/code/kenlm

What KenLM Does Answer language model queries using less time and memory. log p(<s> iran) = -3.33437 log p(<s> iran is ) = -1.05931 log p(<s> iran is one) = -1.80743 log p(<s> iran is one of ) = -0.03705 log p( iran is one of the ) = -0.08317 log p( is one of the few ) = -1.20788

Related Work Downloadable Baselines SRI Popular and considered fast but high-memory IRST Open source, low-memory, single-threaded Rand Low-memory lossy compression MIT Mostly estimates models but also does queries Papers Without Code TPT Better memory locality Sheffield Lossy compression techniques

Related Work Downloadable Baselines SRI Popular and considered fast but high-memory IRST Open source, low-memory, single-threaded Rand Low-memory lossy compression MIT Mostly estimates models but also does queries Papers Without Code TPT Better memory locality Sheffield Lossy compression techniques After KenLM s Public Release Berkeley Java; slower and larger than KenLM

Why I Wrote KenLM Decoding takes too long Answer queries quickly Load quickly with memory mapping Thread-safe

Why I Wrote KenLM Decoding takes too long Answer queries quickly Load quickly with memory mapping Thread-safe Bigger models Conserve memory

Why I Wrote KenLM Decoding takes too long Answer queries quickly Load quickly with memory mapping Thread-safe Bigger models Conserve memory SRI doesn t compile Distribute and compile with decoders

Outline State 1 State 2 Probing 3 Perplexity Translation

Example Language Model State Unigrams Words log p Back <s> - -2.0 iran -4.1-0.8 is -2.5-1.4 one -3.3-0.9 of -2.5-1.1 Bigrams Words log p Back <s> iran -3.3-1.2 iran is -1.7-0.4 is one -2.0-0.9 one of -1.4-0.6 Trigrams Words log p <s> iran is -1.1 iran is one -2.0 is one of -0.3

Example Queries State Unigrams Words log p Back <s> - -2.0 iran -4.1-0.8 is -2.5-1.4 one -3.3-0.9 of -2.5-1.1 Bigrams Words log p Back <s> iran -3.3-1.2 iran is -1.7-0.4 is one -2.0-0.9 one of -1.4-0.6 Trigrams Words log p <s> iran is -1.1 iran is one -2.0 is one of -0.3 Query: <s> iran is log p(<s> iran is) = -1.1 Query: iran is of log p(of) -2.5 Backoff(is) -1.4 Backoff(iran is) + -0.4 log p(iran is of) = -4.3

State Lookups Performed by Queries <s> iran is Lookup 1 is 2 iran is 3 <s> iran is Score log p(<s> iran is) = -1.1 Lookup 1 of 2 is of (not found) 3 is 4 iran is Score iran is of log p(of) -2.5 Backoff(is) -1.4 Backoff(iran is) + -0.4 log p(iran is of) = -4.3

State Lookups Performed by Queries <s> iran is Lookup 1 is 2 iran is 3 <s> iran is Score log p(<s> iran is) = -1.1 Lookup 1 of 2 is of (not found) 3 is 4 iran is Score iran is of log p(of) -2.5 Backoff(is) -1.4 Backoff(iran is) + -0.4 log p(iran is of) = -4.3

State Lookups Performed by Queries <s> iran is Lookup 1 is 2 iran is 3 <s> iran is Score log p(<s> iran is) = -1.1 State Backoff(is) Backoff(iran is) Lookup 1 of 2 is of (not found) 3 is 4 iran is Score iran is of log p(of) -2.5 Backoff(is) -1.4 Backoff(iran is) + -0.4 log p(iran is of) = -4.3

Stateful Query Pattern State log p(<s> iran) = -3.3 log p(<s> iran is ) = -1.1 log p( iran is one) = -2.0 log p( is one of ) = -0.3

Stateful Query Pattern State Backoff(<s>) log p(<s> iran) = -3.3 Backoff(iran), Backoff(<s> iran) log p(<s> iran is ) = -1.1 Backoff(is), Backoff(iran is) log p( iran is one) = -2.0 Backoff(one), Backoff(is one) log p( is one of ) = -0.3 Backoff(of), Backoff(one of)

Probing Probing Fast. Uses hash tables. Small. Uses sorted arrays. Smaller. with compressed pointers. Key Subproblem Sparse lookup: efficiently retrieve values for sparse keys

Probing Sparse Lookup Speed 100 Lookups/µs 10 1 probing hash set unordered interpolation binary search set 10 1000 100000 10 7 Entries

Probing Sparse Lookup Speed 100 Lookups/µs 10 1 probing hash set unordered interpolation binary search set 10 1000 100000 10 7 Entries

Probing Linear Probing Hash Table Store 64-bit hashes and ignore collisions. Bigrams Words Hash log p Back <s> iran 0xf0ae9c2442c6920e -3.3-1.2 iran is 0x959e48455f4a2e90-1.7-0.4 is one 0x186a7caef34acf16-2.0-0.9 one of 0xac66610314db8dac -1.4-0.6

Probing Linear Probing Hash Table 1.5 buckets/entry (so buckets = 6). Ideal bucket = hash mod buckets. Resolve bucket collisions using the next free bucket. Bigrams Words Ideal Hash log p Back iran is 0 0x959e48455f4a2e90-1.7-0.4 0x0 0 0 is one 2 0x186a7caef34acf16-2.0-0.9 one of 2 0xac66610314db8dac -1.4-0.6 <s> iran 4 0xf0ae9c2442c6920e -3.3-1.2 0x0 0 0 Array

Probing Probing Data Structure Unigrams Words log p Back <s> - -2.0 iran -4.1-0.8 is -2.5-1.4 one -3.3-0.9 of -2.5-1.1 Array Bigrams Words log p Back <s> iran -3.3-1.2 iran is -1.7-0.4 is one -2.0-0.9 one of -1.4-0.6 Probing Hash Table Trigrams Words log p <s> iran is -1.1 iran is one -2.0 is one of -0.3 Probing Hash Table

Probing Probing Hash Table Summary Hash tables are fast. But memory is 24 bytes/entry. Next: Saving memory with.

Probing Uses Sorted Arrays Sort in suffix order. Unigrams Words log p Back Ptr <s> - -2.0 iran -4.1-0.8 is -2.5-1.4 one -3.3-0.9 of -2.5-1.1 Bigrams Words log p Back Ptr <s> iran -3.3-1.2 iran is -1.7-0.4 one is -2.3-0.3 <s> one -2.3-1.1 is one -2.0-0.9 one of -1.4-0.6 Trigrams Words log p <s> iran is -1.1 <s> one is -2.3 iran is one -2.0 <s> one of -0.5 is one of -0.3

Probing Sort in suffix order. Encode suffix using pointers. Unigrams Words log p Back Ptr <s> - -2.0 0 iran -4.1-0.8 0 is -2.5-1.4 1 one -3.3-0.9 4 of -2.5-1.1 6 7 Array Bigrams Words log p Back Ptr <s> iran -3.3-1.2 0 <s> is -2.9-1.0 0 iran is -1.7-0.4 0 one is -2.3-0.3 1 <s> one -2.3-1.1 2 is one -2.0-0.9 2 one of -1.4-0.6 3 5 Array Trigrams Words log p <s> iran is -1.1 <s> one is -2.3 iran is one -2.0 <s> one of -0.5 is one of -0.3 Array

Probing Interpolation Search In Interpolation Search O(log log n) Each trie node is a sorted array. Bigrams: * is Words log p Back Ptr <s> is -2.9-1.0 0 iran is -1.7-0.4 0 one is -2.3-0.3 1 key A.first pivot = A A.last A.first Binary Search: O(logn) pivot = A 2

Probing Saving Memory with Bit-Level Packing Store word index and pointer using the minimum number of bits. Optional Quantization Cluster floats into 2 q bins, store q bits/float (same as IRSTLM).

Probing : Compress Pointers Bigrams Words log p Back Ptr <s> iran -3.3-1.2 0 iran is -1.7-0.4 0 one is -2.3-0.3 1 <s> one -2.3-1.1 2 is one -2.0-0.9 2 one of -1.4-0.6 3 5 Increasing

Probing : Compress Pointers Offset Ptr Binary 0 0 000 1 0 000 2 1 001 3 2 010 4 2 010 5 3 011 6 5 101 Raj and Whittaker (2003)

Probing : Compress Pointers Offset Ptr Binary 0 0 000 1 0 000 2 1 001 3 2 010 4 2 010 5 3 011 6 5 101 ped Offset 1 6 Raj and Whittaker (2003)

Probing : Compress Pointers Offset Ptr Binary 0 0 000 1 0 000 2 1 001 3 2 010 4 2 010 5 3 011 6 5 101 ped Offset 01 3 10 6 Raj and Whittaker (2003)

Probing / Summary Save memory: bit packing, quantization, and pointer compression.

Outline Perplexity Translation 1 State 2 Probing 3 Perplexity Translation

Perplexity Task Perplexity Translation Score the English Gigaword corpus. Model SRILM 5-gram from Europarl + De-duplicated News Crawl Measurements Queries/ms Loaded Memory Peak Memory Excludes loading and file reading time Resident after loading Peak virtual after scoring

Perplexity Task: Exact Models 2000 Ken Probing Perplexity Translation 1500 Queries/ms 1000 Ken SRI 500 Ken IRST Inverted IRST SRI Compact MIT 0 Loaded Peak 0 2 4 6 8 10 Memory (GB)

Perplexity Translation Perplexity Task: Berkeley Always Quantizes to 19 bits 2000 Ken Probing 1500 Queries/ms 1000 500 0 Ken Ken IRST Inverted 19 IRST Berkeley Compress 19 Berkeley Scroll 19 19 Berkeley Hash SRI Compact MIT 0 2 4 6 8 10 Memory (GB) SRI

Perplexity Translation Perplexity Task: RandLM from an ARPA file 2000 Ken Probing 1500 Queries/ms 1000 500 0 Ken 8 Ken 8 Berkeley Scroll 19 19 Berkeley Hash IRST Inverted 19 IRST SRI Compact Berkeley Compress 19 8 Rand Backoff p(false) = 2 8 MIT 0 2 4 6 8 10 Memory (GB) SRI

Translation Task Perplexity Translation Translate 3003 sentences using Moses. System WMT 2011 French-English baseline, Europarl+News LM Measurements Time Total wall time, including loading Memory Total resident memory after decoding

Moses Benchmarks: 8 Threads 3 4 Perplexity Translation 8 Rand Backoff 2 8 false Time (h) 1 2 1 4 8 8 Probing SRI 0 0 2 4 6 8 10 12 Memory (GB)

Perplexity Translation Moses Benchmarks: Single Threaded 4 8 Rand Backoff 2 8 false 3 Time (h) 2 1 0 IRST SRI 8 8 Probing 0 2 4 6 8 10 12 Memory (GB)

Perplexity Translation Comparison to RandLM (Unpruned Model, One Thread) 6 5 Rand Backoff 25.89 8 8 26.67 p(false) = 2 8 p(false) = 2 10 Time (h) 4 3 2 1 0 Rand Stupid Backoff 26.69 8 8 26.87 p(false) = 2 8 p(false) = 2 10 27.09 4 8 27.22 27.24 27.09 4 8 27.22 0 2 4 6 8 10 12 Memory (GB)

Conclusion Perplexity Translation Maximize speed and accuracy subject to memory. Probing > > > RandLM Stupid for both speed and memory. Distributed with decoders: Moses cdec Joshua 8 0 5 file KLanguageModel use kenlm=true kheafield.com/code/kenlm/