N-Gram Language Modelling including Feed-Forward NNs. Kenneth Heafield. University of Edinburgh

Size: px
Start display at page:

Download "N-Gram Language Modelling including Feed-Forward NNs. Kenneth Heafield. University of Edinburgh"

Transcription

1 N-Gram Language Modelling including Feed-Forward NNs Kenneth Heafield University of Edinburgh

2 History of Language Model History Kenneth Heafield University of Edinburgh

3 3

4 p(type Predictive) > p(tyler Predictive) 4

5 p(lose Win or) p(loose Win or) [Church et al, 2007] 5

6 Heated indoor swimming pool 6

7 Chambre Bedroom 7

8 présidente de la Chambre des représentants chairwoman of the Bedroom of Representatives 8

9 présidente de la Chambre des représentants chairwoman of the House of Representatives 9

10 présidente de la Chambre des représentants chairwoman of the House of Representatives p(chairwoman of the House of Representatives) > p(chairwoman of the Bedroom of Representatives) 10

11 p(another one bites the dust.) > p(another one rides the bus.) 11

12 Prediction Spelling Translation Speech Essential Component: Language Model p(moses Compiles) =? 12

13 Sequence Models Chain Rule p(moses compiles)=p(moses)p(compiles Moses) 13

14 Sequence Model log p(iran <s> )= log p(is <s> iran )= log p(one <s> iran is )= log p(of <s> iran is one )= log p(the <s> iran is one of )= log p(few <s> iran is one of the )= log p(countries <s> iran is one of the few )= log p(. <s> iran is one of the few countries )= log p(</s> <s> iran is one of the few countries.)= = log p(<s> iran is one of the few countries. </s> )=

15 Sequence Model log p(iran <s> )= log p(is <s> iran )= log p(one <s> iran is )= log p(of <s> iran is one )= log p(the <s> iran is one of )= log p(few <s> iran is one of the )= log p(countries <s> iran is one of the few )= log p(. <s> iran is one of the few countries )= log p(</s> <s> iran is one of the few countries.)= = log p(<s> iran is one of the few countries. </s> )= Explicit begin and end of sentence. 15

16 Sequence Model log p(iran <s> )= log p(is <s> iran )= log p(one <s> iran is )= log p(of <s> iran is one )= log p(the <s> iran is one of )= log p(few <s> iran is one of the )= log p(countries <s> iran is one of the few )= log p(. <s> iran is one of the few countries )= log p(</s> <s> iran is one of the few countries.)= = log p(<s> iran is one of the few countries. </s> )= Where do these probabilities come from? 16

17 Probabilities from Text Model p(raw in the) 17

18 Estimating from Text help in the search for an answer. Copper burned in the raw wood. put forward in the paper Highs in the 50s to lower 60s.. = p(raw in the)

19 Estimating from Text help in the search for an answer. Copper burned in the raw wood. put forward in the paper Highs in the 50s to lower 60s.. = p(raw in the) 1 4 p(ugrasena in the) 0 19

20 Estimating from Text help in the search for an answer. Copper burned in the raw wood. put forward in the paper Highs in the 50s to lower 60s.. = p(raw in the) 1 6 p(ugrasena in the)

21 Problem in the Ugrasena was not seen, but could happen. count(in the Ugrasena) p(ugrasena in the) = count(in the) = 0? 21

22 Problem in the Ugrasena was not seen, but could happen. count(in the Ugrasena) p(ugrasena in the) = = 0? count(in the) count(the Ugrasena) = = count(the) 22

23 Problem in the Ugrasena was not seen, but could happen. count(in the Ugrasena) p(ugrasena in the) = = 0? count(in the) count(the Ugrasena) = = count(the) Stupid Backoff: Drop context until count is non-zero [Brants et al, 2007] Can we be less stupid? 23

24 Smoothing in the Ugrasena was not seen, but could happen. 1 Backoff: maybe the Ugrasena was seen? 2 Neural Networks: classifier predicts next word 24

25 Backoff Smoothing in the Ugrasena was not seen try the Ugrasena p(ugrasena in the) p(ugrasena the) 25

26 Backoff Smoothing in the Ugrasena was not seen try the Ugrasena p(ugrasena in the) p(ugrasena the) the Ugrasena was not seen try Ugrasena p(ugrasena the) p(ugrasena) 26

27 Backoff Smoothing in the Ugrasena was not seen try the Ugrasena p(ugrasena in the) = p(ugrasena the)b(in the) the Ugrasena was not seen try Ugrasena p(ugrasena the) = p(ugrasena)b(the) Backoff b is a penalty for not matching context. 27

28 Backoff Language Model Unigrams Words log p log b <s> 2.0 iran is one of Bigrams Words log p log b <s> iran iran is is one one of Trigrams Words log p <s> iran is 1.1 iran is one 2.0 is one of

29 Backoff Language Model Unigrams Words log p log b <s> 2.0 iran is one of Bigrams Words log p log b <s> iran iran is is one one of Query log p(is <s> iran) = 1.1 Trigrams Words log p <s> iran is 1.1 iran is one 2.0 is one of

30 Backoff Language Model Unigrams Words log p log b <s> 2.0 iran is one of Bigrams Words log p log b <s> iran iran is is one one of Query : p(of iran is) log p(of) 2.5 log b(is) 1.4 log b(iran is) log p(of iran is) = 4.3 Trigrams Words log p <s> iran is 1.1 iran is one 2.0 is one of

31 Close words matter more. Though long-distance matters: Grammatical structure Topical coherence Words tend to repeat Cross-sentence dependencies 31

32 Where do p and b come from? Text! Kneser-Ney Witten-Bell Good-Turing 32

33 Kneser-Ney Common high-quality smoothing 1 Adjust 2 Normalize 3 Interpolate 33

34 Adjusted counts are: Trigrams Count in the text. Others Number of unique words to the left. Lower orders are used when a trigram did not match. How freely does the text associate with new words? 34

35 Adjusted counts are: Trigrams Count in the text. Others Number of unique words to the left. Lower orders are used when a trigram did not match. How freely does the text associate with new words? Input Trigam Count are one of 1 is one of 5 are two of 3 Output 1-gram Adjusted of 2 Output 2-gram Adjusted one of 2 two of 1 35

36 Discounting and Normalization Save mass for unseen events pseudo(w n w n 1 1 ) = adjusted(w 1 n ) discount n (adjusted(w1 n )) n 1 adjusted(w1 x) x Normalize 36

37 Discounting and Normalization Save mass for unseen events pseudo(w n w n 1 1 ) = adjusted(w 1 n ) discount n (adjusted(w1 n )) n 1 adjusted(w1 x) x Normalize Input 3-gram Adjusted are one of 1 are one that 2 is one of 5 Output 3-gram Pseudo are one of 0.26 are one that 0.47 is one of

38 Interpolate: Sparsity vs. Specificity Interpolate unigrams with the uniform distribution. 1 p(of) = pseudo(of) + backoff(ɛ) vocabulary 38

39 Interpolate: Sparsity vs. Specificity Interpolate unigrams with the uniform distribution, 1 p(of) = pseudo(of) + backoff(ɛ) vocabulary Interpolate bigrams with unigrams, etc. p(of one) = pseudo(of one) + backoff(one)p(of) 39

40 Interpolate: Sparsity vs. Specificity Interpolate unigrams with the uniform distribution, 1 p(of) = pseudo(of) + backoff(ɛ) vocabulary Interpolate bigrams with unigrams, etc. p(of one) = pseudo(of one) + backoff(one)p(of) Input n-gram pseudo interpolation weight of 0.1 backoff( ɛ) = 0.1 one of 0.2 backoff( one) = 0.3 are one of 0.4 backoff(are one) = 0.2 Output n-gram p of one of are one of

41 Kneser-Ney Intuition Adjust Measure association with new words. Normalize Leave space for unseen events. Interpolate Handle sparsity. How do we implement it? 41

42 LM queries often account for more than 50% of the CPU [Green et al, WMT 2014] 500 billion entries in my largest model Need speed and memory efficiency 42

43 Counting n-grams <s> Australia is one of the few 5-gram Count <s> Australia is one of 1 Australia is one of the 1 is one of the few 1 Hash table? 43

44 Counting n-grams <s> Australia is one of the few 5-gram Count <s> Australia is one of 1 Australia is one of the 1 is one of the few 1 Hash table? Runs out of RAM. 44

45 Spill to Disk When RAM Runs Out Text Hash Table File 45

46 Split Data Text Hash Table Hash Table File File 46

47 Split and Merge Text Hash Table Hash Table Sort Sort File File Merge Sort 47

48 Training Problem: Batch process large number of records. Solution: Split/merge Stupid backoff in one pass Kneser-Ney in three passes 48

49 Training Problem: Batch process large number of records. Solution: Split/merge Stupid backoff in one pass Kneser-Ney in three passes Training is designed for mutable batch access. What about queries? 49

50 Query stupid(w n w n 1 1 ) = count(w1 n ) count(w n 1 if count(w1 n ) > 0 1 ) 0.4stupid(w n w n 1 2 ) if count(w1 n ) = 0 stupid(few is one of the) count(is one of the few) = 5 count(is one of the) = 12 50

51 Query stupid(w n w n 1 1 ) = count(w1 n ) count(w n 1 if count(w1 n ) > 0 1 ) 0.4stupid(w n w n 1 2 ) if count(w1 n ) = 0 stupid(periwinkle is one of the) count(is one of the periwinkle) = 0 count(one of the periwinkle) = 0 count(of the periwinkle) = 0 count(the periwinkle) = 3 count(the) =

52 Save Memory: Forget Keys Giant hash table with n-grams as keys and counts as values. Replace the n-grams with 64-bit hashes: Store hash(is one of) instead of is one of. Ignore collisions. 52

53 Save Memory: Forget Keys Giant hash table with n-grams as keys and counts as values. Replace the n-grams with 64-bit hashes: Store hash(is one of) instead of is one of. Ignore collisions. Birthday attack: 2 64 = = Low chance of collision until 4 billion entries. 53

54 Default Hash Table boost::unordered_map and gnu_cxx::hash_map n-grams Bucket array 54

55 Default Hash Table boost::unordered_map and gnu_cxx::hash_map n-grams Bucket array Lookup requires two random memory accesses. 55

56 Linear Probing Hash Table 1.5 buckets/entry (so buckets = 6). Ideal bucket = hash mod buckets. Resolve bucket collisions using the next free bucket. Bigrams Words Ideal Hash Count iran is 0 0x959e48455f4a2e90 3 0x0 0 is one 2 0x186a7caef34acf16 5 one of 2 0xac db8dac 2 <s> iran 4 0xf0ae9c2442c6920e 1 0x0 0 56

57 Minimal Perfect Hash Table Maps every n-gram to a unique integer [0, n grams ) Use these as array offsets. 57

58 Minimal Perfect Hash Table Maps every n-gram to a unique integer [0, n grams ) Use these as array offsets. Entries not in the model get assigned offsets Store a fingerprint of each n-gram 58

59 Minimal Perfect Hash Table Maps every n-gram to a unique integer [0, n grams ) Use these as array offsets. Low memory, but potential for false positives 59

60 Less Memory: Sorted Array Look up zebra in a dictionary. Binary search Open in the middle. O(log n) time. Interpolation search Open near the end. O(log log n) time. 60

61 100 Lookups/µs 10 1 probing hash_set unordered_set interpolation binary_search set Entries 61

62 Trie Reverse n-grams, arrange in a trie. is of one are <s> is Australia <s> is Australia one are are Australia <s> 62

63 Saving More RAM Quantization: store approximate values Collapse probability and backoff 63

64 Implementation Summary Implementation involves sparse mapping Hash table Probing hash table Minimal perfect hash table Sorted array with binary or interpolation search 64

65 Problem with Backoff: in the Ugrasena kingdom Ugrasena is unseen use unigram to predict kingdom p(kingdom in the Ugrasena) = p(kingdom) One rare word of context breaks conditioning. 65

66 More Conditioning Word class/cluster model Skip-gram: skip one or more context words Neural networks 66

67 More Conditioning Word class/cluster model Skip-gram: skip one or more context words Neural networks 67

68 Neural N-grams (aka Feed-Forward) Language modeling as classification: Input in the Ugrasena Output kingdom Predict (classify) the next word given preceding words. 68

69 Turning Words into Vectors in kingdom Ugrasena the Assign each word a unique row 69

70 Query: Concatenate Vectors in the Ugrasena Neural Network Prediction 70

71 Issue: Large Vocabulary The input has vocab history dimensions. = The input matrix has vocab history hidden parameters. Say vocab 1 billion. About 1 trillion parameters. 71

72 Dealing with Vocabulary Size Word vectors Map unpopular words to unknown or part of speech Don t use words: characters or short snippets 72

73 Turning Words into Vectors in kingdom Ugrasena the Vectors from the input matrix... or your favorite ACL paper. 73

74 Use Word Vectors Steal word representation from another task (language modeling, IR, SVD). Reduce from vocab one-hot to 100 dimensional dense input. Pro Other task might have more data Con Does not jointly learn word representation 74

75 Output vocabulary Output is still vocab -way classification. = Large output matrix, lots to normalize over. = Slow evaluation, overfitting. 75

76 Dealing with Vocabulary Size Word vectors Map unpopular words to unknown or part of speech Don t use words: characters or short snippets 76

77 Translation Modeling p(bedroom of the American, de la Chambre des représentants des États-Unis) [Devin et al 2014] 77

78 Conclusion Language models measure fluency. Neural networks and backoff are the dominant formalisms.... But you should probably use recurrent networks. 78

KenLM: Faster and Smaller Language Model Queries

KenLM: Faster and Smaller Language Model Queries KenLM: Faster and Smaller Language Model Queries Kenneth heafield@cs.cmu.edu Carnegie Mellon July 30, 2011 kheafield.com/code/kenlm What KenLM Does Answer language model queries using less time and memory.

More information

A Neuro Probabilistic Language Model Bengio et. al. 2003

A Neuro Probabilistic Language Model Bengio et. al. 2003 A Neuro Probabilistic Language Model Bengio et. al. 2003 Class Discussion Notes Scribe: Olivia Winn February 1, 2016 Opening thoughts (or why this paper is interesting): Word embeddings currently have

More information

FastText. Jon Koss, Abhishek Jindal

FastText. Jon Koss, Abhishek Jindal FastText Jon Koss, Abhishek Jindal FastText FastText is on par with state-of-the-art deep learning classifiers in terms of accuracy But it is way faster: FastText can train on more than one billion words

More information

Natural Language Processing

Natural Language Processing Natural Language Processing N-grams and minimal edit distance Pieter Wellens 2012-2013 These slides are based on the course materials from the ANLP course given at the School of Informatics, Edinburgh

More information

Smoothing. BM1: Advanced Natural Language Processing. University of Potsdam. Tatjana Scheffler

Smoothing. BM1: Advanced Natural Language Processing. University of Potsdam. Tatjana Scheffler Smoothing BM1: Advanced Natural Language Processing University of Potsdam Tatjana Scheffler tatjana.scheffler@uni-potsdam.de November 1, 2016 Last Week Language model: P(Xt = wt X1 = w1,...,xt-1 = wt-1)

More information

CS159 - Assignment 2b

CS159 - Assignment 2b CS159 - Assignment 2b Due: Tuesday, Sept. 23 at 2:45pm For the main part of this assignment we will be constructing a number of smoothed versions of a bigram language model and we will be evaluating its

More information

Algorithms for NLP. Language Modeling II. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley

Algorithms for NLP. Language Modeling II. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley Algorithms for NLP Language Modeling II Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley Announcements Should be able to really start project after today s lecture Get familiar with bit-twiddling

More information

Natural Language Processing with Deep Learning CS224N/Ling284

Natural Language Processing with Deep Learning CS224N/Ling284 Natural Language Processing with Deep Learning CS224N/Ling284 Lecture 8: Recurrent Neural Networks Christopher Manning and Richard Socher Organization Extra project office hour today after lecture Overview

More information

Log- linear models. Natural Language Processing: Lecture Kairit Sirts

Log- linear models. Natural Language Processing: Lecture Kairit Sirts Log- linear models Natural Language Processing: Lecture 3 21.09.2017 Kairit Sirts The goal of today s lecture Introduce the log- linear/maximum entropy model Explain the model components: features, parameters,

More information

Natural Language Processing CS 6320 Lecture 6 Neural Language Models. Instructor: Sanda Harabagiu

Natural Language Processing CS 6320 Lecture 6 Neural Language Models. Instructor: Sanda Harabagiu Natural Language Processing CS 6320 Lecture 6 Neural Language Models Instructor: Sanda Harabagiu In this lecture We shall cover: Deep Neural Models for Natural Language Processing Introduce Feed Forward

More information

Natural Language Processing

Natural Language Processing Natural Language Processing Language Models Language models are distributions over sentences N gram models are built from local conditional probabilities Language Modeling II Dan Klein UC Berkeley, The

More information

CS 124/LING 180/LING 280 From Languages to Information Week 2: Group Exercises on Language Modeling Winter 2018

CS 124/LING 180/LING 280 From Languages to Information Week 2: Group Exercises on Language Modeling Winter 2018 CS 124/LING 180/LING 280 From Languages to Information Week 2: Group Exercises on Language Modeling Winter 2018 Dan Jurafsky Tuesday, January 23, 2018 1 Part 1: Group Exercise We are interested in building

More information

Lecture 20: Neural Networks for NLP. Zubin Pahuja

Lecture 20: Neural Networks for NLP. Zubin Pahuja Lecture 20: Neural Networks for NLP Zubin Pahuja zpahuja2@illinois.edu courses.engr.illinois.edu/cs447 CS447: Natural Language Processing 1 Today s Lecture Feed-forward neural networks as classifiers simple

More information

Table ADT and Sorting. Algorithm topics continuing (or reviewing?) CS 24 curriculum

Table ADT and Sorting. Algorithm topics continuing (or reviewing?) CS 24 curriculum Table ADT and Sorting Algorithm topics continuing (or reviewing?) CS 24 curriculum A table ADT (a.k.a. Dictionary, Map) Table public interface: // Put information in the table, and a unique key to identify

More information

CSCI 599: Applications of Natural Language Processing Information Retrieval Retrieval Models (Part 3)"

CSCI 599: Applications of Natural Language Processing Information Retrieval Retrieval Models (Part 3) CSCI 599: Applications of Natural Language Processing Information Retrieval Retrieval Models (Part 3)" All slides Addison Wesley, Donald Metzler, and Anton Leuski, 2008, 2012! Language Model" Unigram language

More information

Understand how to deal with collisions

Understand how to deal with collisions Understand the basic structure of a hash table and its associated hash function Understand what makes a good (and a bad) hash function Understand how to deal with collisions Open addressing Separate chaining

More information

CMSC 341 Lecture 16/17 Hashing, Parts 1 & 2

CMSC 341 Lecture 16/17 Hashing, Parts 1 & 2 CMSC 341 Lecture 16/17 Hashing, Parts 1 & 2 Prof. John Park Based on slides from previous iterations of this course Today s Topics Overview Uses and motivations of hash tables Major concerns with hash

More information

Data Structures And Algorithms

Data Structures And Algorithms Data Structures And Algorithms Hashing Eng. Anis Nazer First Semester 2017-2018 Searching Search: find if a key exists in a given set Searching algorithms: linear (sequential) search binary search Search

More information

A Hybrid Neural Model for Type Classification of Entity Mentions

A Hybrid Neural Model for Type Classification of Entity Mentions A Hybrid Neural Model for Type Classification of Entity Mentions Motivation Types group entities to categories Entity types are important for various NLP tasks Our task: predict an entity mention s type

More information

Module 5: Hashing. CS Data Structures and Data Management. Reza Dorrigiv, Daniel Roche. School of Computer Science, University of Waterloo

Module 5: Hashing. CS Data Structures and Data Management. Reza Dorrigiv, Daniel Roche. School of Computer Science, University of Waterloo Module 5: Hashing CS 240 - Data Structures and Data Management Reza Dorrigiv, Daniel Roche School of Computer Science, University of Waterloo Winter 2010 Reza Dorrigiv, Daniel Roche (CS, UW) CS240 - Module

More information

Hash-Based Indexing 1

Hash-Based Indexing 1 Hash-Based Indexing 1 Tree Indexing Summary Static and dynamic data structures ISAM and B+ trees Speed up both range and equality searches B+ trees very widely used in practice ISAM trees can be useful

More information

Coding for Random Projects

Coding for Random Projects Coding for Random Projects CS 584: Big Data Analytics Material adapted from Li s talk at ICML 2014 (http://techtalks.tv/talks/coding-for-random-projections/61085/) Random Projections for High-Dimensional

More information

CSE 332: Data Structures & Parallelism Lecture 10:Hashing. Ruth Anderson Autumn 2018

CSE 332: Data Structures & Parallelism Lecture 10:Hashing. Ruth Anderson Autumn 2018 CSE 332: Data Structures & Parallelism Lecture 10:Hashing Ruth Anderson Autumn 2018 Today Dictionaries Hashing 10/19/2018 2 Motivating Hash Tables For dictionary with n key/value pairs insert find delete

More information

Kathleen Durant PhD Northeastern University CS Indexes

Kathleen Durant PhD Northeastern University CS Indexes Kathleen Durant PhD Northeastern University CS 3200 Indexes Outline for the day Index definition Types of indexes B+ trees ISAM Hash index Choosing indexed fields Indexes in InnoDB 2 Indexes A typical

More information

1 Probability Review. CS 124 Section #8 Hashing, Skip Lists 3/20/17. Expectation (weighted average): the expectation of a random quantity X is:

1 Probability Review. CS 124 Section #8 Hashing, Skip Lists 3/20/17. Expectation (weighted average): the expectation of a random quantity X is: CS 124 Section #8 Hashing, Skip Lists 3/20/17 1 Probability Review Expectation (weighted average): the expectation of a random quantity X is: x= x P (X = x) For each value x that X can take on, we look

More information

Symbol Table. Symbol table is used widely in many applications. dictionary is a kind of symbol table data dictionary is database management

Symbol Table. Symbol table is used widely in many applications. dictionary is a kind of symbol table data dictionary is database management Hashing Symbol Table Symbol table is used widely in many applications. dictionary is a kind of symbol table data dictionary is database management In general, the following operations are performed on

More information

Database Applications (15-415)

Database Applications (15-415) Database Applications (15-415) DBMS Internals- Part V Lecture 13, March 10, 2014 Mohammad Hammoud Today Welcome Back from Spring Break! Today Last Session: DBMS Internals- Part IV Tree-based (i.e., B+

More information

Efficient Data Structures for Massive N-Gram Datasets

Efficient Data Structures for Massive N-Gram Datasets Efficient ata Structures for Massive N-Gram atasets Giulio Ermanno Pibiri University of Pisa and ISTI-CNR Pisa, Italy giulio.pibiri@di.unipi.it Rossano Venturini University of Pisa and ISTI-CNR Pisa, Italy

More information

EECS 496 Statistical Language Models. Winter 2018

EECS 496 Statistical Language Models. Winter 2018 EECS 496 Statistical Language Models Winter 2018 Introductions Professor: Doug Downey Course web site: www.cs.northwestern.edu/~ddowney/courses/496_winter2018 (linked off prof. home page) Logistics Grading

More information

Hash Tables. Hash Tables

Hash Tables. Hash Tables Hash Tables Hash Tables Insanely useful One of the most useful and used data structures that you ll encounter They do not support many operations But they are amazing at the operations they do support

More information

Hash-Based Indexes. Chapter 11

Hash-Based Indexes. Chapter 11 Hash-Based Indexes Chapter 11 1 Introduction : Hash-based Indexes Best for equality selections. Cannot support range searches. Static and dynamic hashing techniques exist: Trade-offs similar to ISAM vs.

More information

Open Addressing: Linear Probing (cont.)

Open Addressing: Linear Probing (cont.) Open Addressing: Linear Probing (cont.) Cons of Linear Probing () more complex insert, find, remove methods () primary clustering phenomenon items tend to cluster together in the bucket array, as clustering

More information

Introduction to Information Retrieval (Manning, Raghavan, Schutze)

Introduction to Information Retrieval (Manning, Raghavan, Schutze) Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 3 Dictionaries and Tolerant retrieval Chapter 4 Index construction Chapter 5 Index compression Content Dictionary data structures

More information

Hashed-Based Indexing

Hashed-Based Indexing Topics Hashed-Based Indexing Linda Wu Static hashing Dynamic hashing Extendible Hashing Linear Hashing (CMPT 54 4-) Chapter CMPT 54 4- Static Hashing An index consists of buckets 0 ~ N-1 A bucket consists

More information

CMSC 723: Computational Linguistics I Session #12 MapReduce and Data Intensive NLP. University of Maryland. Wednesday, November 18, 2009

CMSC 723: Computational Linguistics I Session #12 MapReduce and Data Intensive NLP. University of Maryland. Wednesday, November 18, 2009 CMSC 723: Computational Linguistics I Session #12 MapReduce and Data Intensive NLP Jimmy Lin and Nitin Madnani University of Maryland Wednesday, November 18, 2009 Three Pillars of Statistical NLP Algorithms

More information

Advanced Algorithmics (6EAP) MTAT Hashing. Jaak Vilo 2016 Fall

Advanced Algorithmics (6EAP) MTAT Hashing. Jaak Vilo 2016 Fall Advanced Algorithmics (6EAP) MTAT.03.238 Hashing Jaak Vilo 2016 Fall Jaak Vilo 1 ADT asscociative array INSERT, SEARCH, DELETE An associative array (also associative container, map, mapping, dictionary,

More information

Information Retrieval

Information Retrieval Introduction to Information Retrieval Lecture 3: Dictionaries and tolerant retrieval 1 Outline Dictionaries Wildcard queries skip Edit distance skip Spelling correction skip Soundex 2 Inverted index Our

More information

CARNEGIE MELLON UNIVERSITY DEPT. OF COMPUTER SCIENCE DATABASE APPLICATIONS

CARNEGIE MELLON UNIVERSITY DEPT. OF COMPUTER SCIENCE DATABASE APPLICATIONS CARNEGIE MELLON UNIVERSITY DEPT. OF COMPUTER SCIENCE 15-415 DATABASE APPLICATIONS C. Faloutsos Indexing and Hashing 15-415 Database Applications http://www.cs.cmu.edu/~christos/courses/dbms.s00/ general

More information

Fast Lookup: Hash tables

Fast Lookup: Hash tables CSE 100: HASHING Operations: Find (key based look up) Insert Delete Fast Lookup: Hash tables Consider the 2-sum problem: Given an unsorted array of N integers, find all pairs of elements that sum to a

More information

Introduction to Hashing

Introduction to Hashing Lecture 11 Hashing Introduction to Hashing We have learned that the run-time of the most efficient search in a sorted list can be performed in order O(lg 2 n) and that the most efficient sort by key comparison

More information

A Simple (?) Exercise: Predicting the Next Word

A Simple (?) Exercise: Predicting the Next Word CS11-747 Neural Networks for NLP A Simple (?) Exercise: Predicting the Next Word Graham Neubig Site https://phontron.com/class/nn4nlp2017/ Are These Sentences OK? Jane went to the store. store to Jane

More information

Bloom Filter and Lossy Dictionary Based Language Models

Bloom Filter and Lossy Dictionary Based Language Models Bloom Filter and Lossy Dictionary Based Language Models Abby D. Levenberg E H U N I V E R S I T Y T O H F R G E D I N B U Master of Science Cognitive Science and Natural Language Processing School of Informatics

More information

TABLES AND HASHING. Chapter 13

TABLES AND HASHING. Chapter 13 Data Structures Dr Ahmed Rafat Abas Computer Science Dept, Faculty of Computer and Information, Zagazig University arabas@zu.edu.eg http://www.arsaliem.faculty.zu.edu.eg/ TABLES AND HASHING Chapter 13

More information

Hash Open Indexing. Data Structures and Algorithms CSE 373 SP 18 - KASEY CHAMPION 1

Hash Open Indexing. Data Structures and Algorithms CSE 373 SP 18 - KASEY CHAMPION 1 Hash Open Indexing Data Structures and Algorithms CSE 373 SP 18 - KASEY CHAMPION 1 Warm Up Consider a StringDictionary using separate chaining with an internal capacity of 10. Assume our buckets are implemented

More information

Index Construction. Dictionary, postings, scalable indexing, dynamic indexing. Web Search

Index Construction. Dictionary, postings, scalable indexing, dynamic indexing. Web Search Index Construction Dictionary, postings, scalable indexing, dynamic indexing Web Search 1 Overview Indexes Query Indexing Ranking Results Application Documents User Information analysis Query processing

More information

1. [1 pt] What is the solution to the recurrence T(n) = 2T(n-1) + 1, T(1) = 1

1. [1 pt] What is the solution to the recurrence T(n) = 2T(n-1) + 1, T(1) = 1 Asymptotics, Recurrence and Basic Algorithms 1. [1 pt] What is the solution to the recurrence T(n) = 2T(n-1) + 1, T(1) = 1 2. O(n) 2. [1 pt] What is the solution to the recurrence T(n) = T(n/2) + n, T(1)

More information

Hashing. 1. Introduction. 2. Direct-address tables. CmSc 250 Introduction to Algorithms

Hashing. 1. Introduction. 2. Direct-address tables. CmSc 250 Introduction to Algorithms Hashing CmSc 250 Introduction to Algorithms 1. Introduction Hashing is a method of storing elements in a table in a way that reduces the time for search. Elements are assumed to be records with several

More information

ImageNet Classification with Deep Convolutional Neural Networks

ImageNet Classification with Deep Convolutional Neural Networks ImageNet Classification with Deep Convolutional Neural Networks Alex Krizhevsky Ilya Sutskever Geoffrey Hinton University of Toronto Canada Paper with same name to appear in NIPS 2012 Main idea Architecture

More information

Logistic Regression: Probabilistic Interpretation

Logistic Regression: Probabilistic Interpretation Logistic Regression: Probabilistic Interpretation Approximate 0/1 Loss Logistic Regression Adaboost (z) SVM Solution: Approximate 0/1 loss with convex loss ( surrogate loss) 0-1 z = y w x SVM (hinge),

More information

Database Applications (15-415)

Database Applications (15-415) Database Applications (15-415) DBMS Internals- Part V Lecture 15, March 15, 2015 Mohammad Hammoud Today Last Session: DBMS Internals- Part IV Tree-based (i.e., B+ Tree) and Hash-based (i.e., Extendible

More information

DATA STRUCTURES/UNIT 3

DATA STRUCTURES/UNIT 3 UNIT III SORTING AND SEARCHING 9 General Background Exchange sorts Selection and Tree Sorting Insertion Sorts Merge and Radix Sorts Basic Search Techniques Tree Searching General Search Trees- Hashing.

More information

Indexing. Lecture Objectives. Text Technologies for Data Science INFR Learn about and implement Boolean search Inverted index Positional index

Indexing. Lecture Objectives. Text Technologies for Data Science INFR Learn about and implement Boolean search Inverted index Positional index Text Technologies for Data Science INFR11145 Indexing Instructor: Walid Magdy 03-Oct-2017 Lecture Objectives Learn about and implement Boolean search Inverted index Positional index 2 1 Indexing Process

More information

B+ Tree Review. CSE332: Data Abstractions Lecture 10: More B Trees; Hashing. Can do a little better with insert. Adoption for insert

B+ Tree Review. CSE332: Data Abstractions Lecture 10: More B Trees; Hashing. Can do a little better with insert. Adoption for insert B+ Tree Review CSE2: Data Abstractions Lecture 10: More B Trees; Hashing Dan Grossman Spring 2010 M-ary tree with room for L data items at each leaf Order property: Subtree between keys x and y contains

More information

Module 3: Hashing Lecture 9: Static and Dynamic Hashing. The Lecture Contains: Static hashing. Hashing. Dynamic hashing. Extendible hashing.

Module 3: Hashing Lecture 9: Static and Dynamic Hashing. The Lecture Contains: Static hashing. Hashing. Dynamic hashing. Extendible hashing. The Lecture Contains: Hashing Dynamic hashing Extendible hashing Insertion file:///c /Documents%20and%20Settings/iitkrana1/My%20Documents/Google%20Talk%20Received%20Files/ist_data/lecture9/9_1.htm[6/14/2012

More information

Lign/CSE 256, Programming Assignment 1: Language Models

Lign/CSE 256, Programming Assignment 1: Language Models Lign/CSE 256, Programming Assignment 1: Language Models 16 January 2008 due 1 Feb 2008 1 Preliminaries First, make sure you can access the course materials. 1 The components are: ˆ code1.zip: the Java

More information

Sparse Non-negative Matrix Language Modeling

Sparse Non-negative Matrix Language Modeling Sparse Non-negative Matrix Language Modeling Joris Pelemans Noam Shazeer Ciprian Chelba joris@pelemans.be noam@google.com ciprianchelba@google.com 1 Outline Motivation Sparse Non-negative Matrix Language

More information

Hash-Based Indexes. Chapter 11. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1

Hash-Based Indexes. Chapter 11. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1 Hash-Based Indexes Chapter Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1 Introduction As for any index, 3 alternatives for data entries k*: Data record with key value k

More information

Data Structures and Algorithms

Data Structures and Algorithms Data Structures and Algorithms CS245-2008S-19 B-Trees David Galles Department of Computer Science University of San Francisco 19-0: Indexing Operations: Add an element Remove an element Find an element,

More information

Big Data Analytics CSCI 4030

Big Data Analytics CSCI 4030 High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering data streams SVM Recommen der systems Clustering Community Detection Queries on streams

More information

Machine Learning 13. week

Machine Learning 13. week Machine Learning 13. week Deep Learning Convolutional Neural Network Recurrent Neural Network 1 Why Deep Learning is so Popular? 1. Increase in the amount of data Thanks to the Internet, huge amount of

More information

UNIT 6A Organizing Data: Lists Principles of Computing, Carnegie Mellon University

UNIT 6A Organizing Data: Lists Principles of Computing, Carnegie Mellon University UNIT 6A Organizing Data: Lists Carnegie Mellon University 1 Data Structure The organization of data is a very important issue for computation. A data structure is a way of storing data in a computer so

More information

Selection Queries. to answer a selection query (ssn=10) needs to traverse a full path.

Selection Queries. to answer a selection query (ssn=10) needs to traverse a full path. Hashing B+-tree is perfect, but... Selection Queries to answer a selection query (ssn=) needs to traverse a full path. In practice, 3-4 block accesses (depending on the height of the tree, buffering) Any

More information

Data and File Structures Chapter 11. Hashing

Data and File Structures Chapter 11. Hashing Data and File Structures Chapter 11 Hashing 1 Motivation Sequential Searching can be done in O(N) access time, meaning that the number of seeks grows in proportion to the size of the file. B-Trees improve

More information

Hash[ string key ] ==> integer value

Hash[ string key ] ==> integer value Hashing 1 Overview Hash[ string key ] ==> integer value Hash Table Data Structure : Use-case To support insertion, deletion and search in average-case constant time Assumption: Order of elements irrelevant

More information

PS2 out today. Lab 2 out today. Lab 1 due today - how was it?

PS2 out today. Lab 2 out today. Lab 1 due today - how was it? 6.830 Lecture 7 9/25/2017 PS2 out today. Lab 2 out today. Lab 1 due today - how was it? Project Teams Due Wednesday Those of you who don't have groups -- send us email, or hand in a sheet with just your

More information

Introduction. hashing performs basic operations, such as insertion, better than other ADTs we ve seen so far

Introduction. hashing performs basic operations, such as insertion, better than other ADTs we ve seen so far Chapter 5 Hashing 2 Introduction hashing performs basic operations, such as insertion, deletion, and finds in average time better than other ADTs we ve seen so far 3 Hashing a hash table is merely an hashing

More information

The dictionary problem

The dictionary problem 6 Hashing The dictionary problem Different approaches to the dictionary problem: previously: Structuring the set of currently stored keys: lists, trees, graphs,... structuring the complete universe of

More information

Some Practice Problems on Hardware, File Organization and Indexing

Some Practice Problems on Hardware, File Organization and Indexing Some Practice Problems on Hardware, File Organization and Indexing Multiple Choice State if the following statements are true or false. 1. On average, repeated random IO s are as efficient as repeated

More information

UNIT 6A Organizing Data: Lists. Last Two Weeks

UNIT 6A Organizing Data: Lists. Last Two Weeks UNIT 6A Organizing Data: Lists 1 Last Two Weeks Algorithms: Searching and sorting Problem solving technique: Recursion Asymptotic worst case analysis using the big O notation 2 1 This Week Observe: The

More information

Review: Memory, Disks, & Files. File Organizations and Indexing. Today: File Storage. Alternative File Organizations. Cost Model for Analysis

Review: Memory, Disks, & Files. File Organizations and Indexing. Today: File Storage. Alternative File Organizations. Cost Model for Analysis File Organizations and Indexing Review: Memory, Disks, & Files Lecture 4 R&G Chapter 8 "If you don't find it in the index, look very carefully through the entire catalogue." -- Sears, Roebuck, and Co.,

More information

Recurrent Neural Networks

Recurrent Neural Networks Recurrent Neural Networks 11-785 / Fall 2018 / Recitation 7 Raphaël Olivier Recap : RNNs are magic They have infinite memory They handle all kinds of series They re the basis of recent NLP : Translation,

More information

Automatic Summarization

Automatic Summarization Automatic Summarization CS 769 Guest Lecture Andrew B. Goldberg goldberg@cs.wisc.edu Department of Computer Sciences University of Wisconsin, Madison February 22, 2008 Andrew B. Goldberg (CS Dept) Summarization

More information

Search Engines. Information Retrieval in Practice

Search Engines. Information Retrieval in Practice Search Engines Information Retrieval in Practice All slides Addison Wesley, 2008 Web Crawler Finds and downloads web pages automatically provides the collection for searching Web is huge and constantly

More information

An empirical study of smoothing techniques for language modeling

An empirical study of smoothing techniques for language modeling Computer Speech and Language (1999) 13, 359 394 Article No. csla.1999.128 Available online at http://www.idealibrary.com on An empirical study of smoothing techniques for language modeling Stanley F. Chen

More information

CMSC 341 Hashing (Continued) Based on slides from previous iterations of this course

CMSC 341 Hashing (Continued) Based on slides from previous iterations of this course CMSC 341 Hashing (Continued) Based on slides from previous iterations of this course Today s Topics Review Uses and motivations of hash tables Major concerns with hash tables Properties Hash function Hash

More information

Scalable Trigram Backoff Language Models

Scalable Trigram Backoff Language Models Scalable Trigram Backoff Language Models Kristie Seymore Ronald Rosenfeld May 1996 CMU-CS-96-139 School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213 This material is based upon work

More information

Hashing. CptS 223 Advanced Data Structures. Larry Holder School of Electrical Engineering and Computer Science Washington State University

Hashing. CptS 223 Advanced Data Structures. Larry Holder School of Electrical Engineering and Computer Science Washington State University Hashing CptS 223 Advanced Data Structures Larry Holder School of Electrical Engineering and Computer Science Washington State University 1 Overview Hashing Technique supporting insertion, deletion and

More information

Finding Similar Sets. Applications Shingling Minhashing Locality-Sensitive Hashing

Finding Similar Sets. Applications Shingling Minhashing Locality-Sensitive Hashing Finding Similar Sets Applications Shingling Minhashing Locality-Sensitive Hashing Goals Many Web-mining problems can be expressed as finding similar sets:. Pages with similar words, e.g., for classification

More information

Lecture 16. Reading: Weiss Ch. 5 CSE 100, UCSD: LEC 16. Page 1 of 40

Lecture 16. Reading: Weiss Ch. 5 CSE 100, UCSD: LEC 16. Page 1 of 40 Lecture 16 Hashing Hash table and hash function design Hash functions for integers and strings Collision resolution strategies: linear probing, double hashing, random hashing, separate chaining Hash table

More information

Models for Document & Query Representation. Ziawasch Abedjan

Models for Document & Query Representation. Ziawasch Abedjan Models for Document & Query Representation Ziawasch Abedjan Overview Introduction & Definition Boolean retrieval Vector Space Model Probabilistic Information Retrieval Language Model Approach Summary Overview

More information

Learning Data Systems Components

Learning Data Systems Components Work partially done at Learning Data Systems Components Tim Kraska [Disclaimer: I am NOT talking on behalf of Google] Comments on Social Media Joins Sorting HashMaps Tree Bloom Filter

More information

Graphs. Data Structures and Algorithms CSE 373 SU 18 BEN JONES 1

Graphs. Data Structures and Algorithms CSE 373 SU 18 BEN JONES 1 Graphs Data Structures and Algorithms CSE 373 SU 18 BEN JONES 1 Warmup Discuss with your neighbors: Come up with as many kinds of relational data as you can (data that can be represented with a graph).

More information

Please note that some of the resources used in this assignment require a Stanford Network Account and therefore may not be accessible.

Please note that some of the resources used in this assignment require a Stanford Network Account and therefore may not be accessible. Please note that some of the resources used in this assignment require a Stanford Network Account and therefore may not be accessible. CS 224N / Ling 237 Programming Assignment 1: Language Modeling Due

More information

Indexing. CS6200: Information Retrieval. Index Construction. Slides by: Jesse Anderton

Indexing. CS6200: Information Retrieval. Index Construction. Slides by: Jesse Anderton Indexing Index Construction CS6200: Information Retrieval Slides by: Jesse Anderton Motivation: Scale Corpus Terms Docs Entries A term incidence matrix with V terms and D documents has O(V x D) entries.

More information

COMP 551 Applied Machine Learning Lecture 16: Deep Learning

COMP 551 Applied Machine Learning Lecture 16: Deep Learning COMP 551 Applied Machine Learning Lecture 16: Deep Learning Instructor: Ryan Lowe (ryan.lowe@cs.mcgill.ca) Slides mostly by: Class web page: www.cs.mcgill.ca/~hvanho2/comp551 Unless otherwise noted, all

More information

III Data Structures. Dynamic sets

III Data Structures. Dynamic sets III Data Structures Elementary Data Structures Hash Tables Binary Search Trees Red-Black Trees Dynamic sets Sets are fundamental to computer science Algorithms may require several different types of operations

More information

Information Retrieval

Information Retrieval Introduction to Information Retrieval Lecture 4: Index Construction 1 Plan Last lecture: Dictionary data structures Tolerant retrieval Wildcards Spell correction Soundex a-hu hy-m n-z $m mace madden mo

More information

COSC160: Data Structures Hashing Structures. Jeremy Bolton, PhD Assistant Teaching Professor

COSC160: Data Structures Hashing Structures. Jeremy Bolton, PhD Assistant Teaching Professor COSC160: Data Structures Hashing Structures Jeremy Bolton, PhD Assistant Teaching Professor Outline I. Hashing Structures I. Motivation and Review II. Hash Functions III. HashTables I. Implementations

More information

Data Structure Lecture#22: Searching 3 (Chapter 9) U Kang Seoul National University

Data Structure Lecture#22: Searching 3 (Chapter 9) U Kang Seoul National University Data Structure Lecture#22: Searching 3 (Chapter 9) U Kang Seoul National University U Kang 1 In This Lecture Motivation of collision resolution policy Open hashing for collision resolution Closed hashing

More information

Hashing for searching

Hashing for searching Hashing for searching Consider searching a database of records on a given key. There are three standard techniques: Searching sequentially start at the first record and look at each record in turn until

More information

File Structures and Indexing

File Structures and Indexing File Structures and Indexing CPS352: Database Systems Simon Miner Gordon College Last Revised: 10/11/12 Agenda Check-in Database File Structures Indexing Database Design Tips Check-in Database File Structures

More information

CS698F Advanced Data Management. Instructor: Medha Atre. Aug 11, 2017 CS698F Adv Data Mgmt 1

CS698F Advanced Data Management. Instructor: Medha Atre. Aug 11, 2017 CS698F Adv Data Mgmt 1 CS698F Advanced Data Management Instructor: Medha Atre Aug 11, 2017 CS698F Adv Data Mgmt 1 Recap Query optimization components. Relational algebra rules. How to rewrite queries with relational algebra

More information

COS 318: Operating Systems. File Systems. Topics. Evolved Data Center Storage Hierarchy. Traditional Data Center Storage Hierarchy

COS 318: Operating Systems. File Systems. Topics. Evolved Data Center Storage Hierarchy. Traditional Data Center Storage Hierarchy Topics COS 318: Operating Systems File Systems hierarchy File system abstraction File system operations File system protection 2 Traditional Data Center Hierarchy Evolved Data Center Hierarchy Clients

More information

Worst-case running time for RANDOMIZED-SELECT

Worst-case running time for RANDOMIZED-SELECT Worst-case running time for RANDOMIZED-SELECT is ), even to nd the minimum The algorithm has a linear expected running time, though, and because it is randomized, no particular input elicits the worst-case

More information

1. Attempt any three of the following: 15

1. Attempt any three of the following: 15 (Time: 2½ hours) Total Marks: 75 N. B.: (1) All questions are compulsory. (2) Make suitable assumptions wherever necessary and state the assumptions made. (3) Answers to the same question must be written

More information

Indexing. Week 14, Spring Edited by M. Naci Akkøk, , Contains slides from 8-9. April 2002 by Hector Garcia-Molina, Vera Goebel

Indexing. Week 14, Spring Edited by M. Naci Akkøk, , Contains slides from 8-9. April 2002 by Hector Garcia-Molina, Vera Goebel Indexing Week 14, Spring 2005 Edited by M. Naci Akkøk, 5.3.2004, 3.3.2005 Contains slides from 8-9. April 2002 by Hector Garcia-Molina, Vera Goebel Overview Conventional indexes B-trees Hashing schemes

More information

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu 2/25/2013 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 3 In many data mining

More information

CPSC 340: Machine Learning and Data Mining. Probabilistic Classification Fall 2017

CPSC 340: Machine Learning and Data Mining. Probabilistic Classification Fall 2017 CPSC 340: Machine Learning and Data Mining Probabilistic Classification Fall 2017 Admin Assignment 0 is due tonight: you should be almost done. 1 late day to hand it in Monday, 2 late days for Wednesday.

More information

File Organization and Storage Structures

File Organization and Storage Structures File Organization and Storage Structures o Storage of data File Organization and Storage Structures Primary Storage = Main Memory Fast Volatile Expensive Secondary Storage = Files in disks or tapes Non-Volatile

More information

Administrative. Distributed indexing. Index Compression! What I did last summer lunch talks today. Master. Tasks

Administrative. Distributed indexing. Index Compression! What I did last summer lunch talks today. Master. Tasks Administrative Index Compression! n Assignment 1? n Homework 2 out n What I did last summer lunch talks today David Kauchak cs458 Fall 2012 adapted from: http://www.stanford.edu/class/cs276/handouts/lecture5-indexcompression.ppt

More information