Efficient Data Structures for Massive N-Gram Datasets
|
|
- Clara Flynn
- 5 years ago
- Views:
Transcription
1 Efficient ata Structures for Massive N-Gram atasets Giulio Ermanno Pibiri University of Pisa and ISTI-CNR Pisa, Italy Rossano Venturini University of Pisa and ISTI-CNR Pisa, Italy The 40-th CM SIGIR Conference on Research and evelopment in Information Retrieval Tokyo, Japan 10/08/2017 1
2 N-grams - Introduction Strings of N words. N typically ranges from 1 to 5. Extracted from text using a sliding window approach. 2
3 N-grams - Introduction Strings of N words. N typically ranges from 1 to 5. Extracted from text using a sliding window approach. 2
4 N-grams - Introduction Strings of N words. N typically ranges from 1 to 5. Extracted from text using a sliding window approach. ooks 6% of the books ever published 2
5 N-grams - Introduction Strings of N words. N typically ranges from 1 to 5. Extracted from text using a sliding window approach. ooks 6% of the books ever published N number of grams 1 24,359, ,284, ,397,041, ,644,807, ,415,355,596 More than 11 billion grams. 2
6 N-grams - Challenge Store massive N-grams datasets in compressed space such that given a pattern, we can return its value efficiently. 3
7 N-grams - Challenge Store massive N-grams datasets in compressed space such that given a pattern, we can return its value efficiently. N-Gram values 3
8 N-grams - Challenge Store massive N-grams datasets in compressed space such that given a pattern, we can return its value efficiently. N-Gram values frequency count (integer) 3
9 N-grams - Challenge Store massive N-grams datasets in compressed space such that given a pattern, we can return its value efficiently. N-Gram values frequency count (integer) probability/backoff weight (floating point) For backoff-interpolated models, such as Kneser-Ney. 3
10 N-grams - Challenge Store massive N-grams datasets in compressed space such that given a pattern, we can return its value efficiently. N-Gram values frequency count (integer) probability/backoff weight (floating point) For backoff-interpolated models, such as Kneser-Ney. Efficient map 3
11 N-grams - Challenge Store massive N-grams datasets in compressed space such that given a pattern, we can return its value efficiently. N-Gram values frequency count (integer) probability/backoff weight (floating point) For backoff-interpolated models, such as Kneser-Ney. Efficient map hash + time - space 3
12 N-grams - Challenge Store massive N-grams datasets in compressed space such that given a pattern, we can return its value efficiently. N-Gram values frequency count (integer) probability/backoff weight (floating point) For backoff-interpolated models, such as Kneser-Ney. Efficient map hash trie + time - space + space - time 3
13 N-grams - Challenge Store massive N-grams datasets in compressed space such that given a pattern, we can return its value efficiently. N-Gram values frequency count (integer) probability/backoff weight (floating point) For backoff-interpolated models, such as Kneser-Ney. Efficient map hash trie + time - space + space - time ctive field of research Many software libraries KenLM [Heafield, WMT 2011] erkeleylm [Pauls and Klein, CL 2011] ExpGram [Watanabe at el., IJCNLP 2009] IRSTLM [Federico et al., CL 2008] RandLM [Talbot and Osborne, CL 2007] SRILM [Stolcke, INTERSPEECH 2002] 3
14 Trie Indexing 4
15 4 C C C C Trie Indexing
16 4 C C C C Trie Indexing
17 4 C C C C Trie Indexing C C C C
18 Trie Indexing C C C hash vocabulary 0 1 C 2 3 C C C C C 4
19 Trie Indexing C C C hash vocabulary 0 1 C C2 3 C C2 0 C C
20 Trie Indexing C C C hash vocabulary 0 1 C C2 3 C C2 0 C C
21 Trie Indexing C C C hash vocabulary 0 1 C C2 3 C C2 0 C C
22 Trie Indexing C C C hash vocabulary 0 1 C C2 3 C C2 0 C C
23 Trie Indexing C 0 We 3 need 5 an 5 encoder for integer sequences, supporting fast random ccess. C C hash vocabulary 0 1 C 2 3 C 0 1 C C2 0 C C
24 Trie Indexing C 0 We 3 need 5 an 5 encoder for integer sequences, supporting fast random ccess. C C hash vocabulary 0 1 C 2 3 Take range-wise prefix sums on gram-i sequences. C 0 1 C C2 0 C C
25 Trie Indexing C 0 We 3 need 5 an 5 encoder for integer sequences, supporting fast random ccess. C C hash vocabulary 0 1 C 2 3 Take range-wise prefix sums on gram-i sequences. C 0 1 C C C C
26 Trie Indexing C 0 We 3 need 5 an 5 encoder for integer sequences, supporting fast random ccess. C C hash vocabulary 0 1 C 2 3 Take range-wise prefix sums on gram-i sequences. C Elias-Fano Tries One NextGEQ per level Constant-time random ccess 0 1 C C C C
27 Context-based I Remapping Observation: the number of words following a given context is small. High-level idea: map a word I to the position it takes within its sibling Is (the Is following a context of fixed length k). 5
28 Context-based I Remapping Observation: the number of words following a given context is small. High-level idea: map a word I to the position it takes within its sibling Is (the Is following a context of fixed length k). 5
29 Context-based I Remapping Observation: the number of words following a given context is small. High-level idea: map a word I to the position it takes within its sibling Is (the Is following a context of fixed length k). k = 1 5
30 Context-based I Remapping Observation: the number of words following a given context is small. High-level idea: map a word I to the position it takes within its sibling Is (the Is following a context of fixed length k). k = 1 5
31 Context-based I Remapping Observation: the number of words following a given context is small. High-level idea: map a word I to the position it takes within its sibling Is (the Is following a context of fixed length k). k = 1 5
32 Context-based I Remapping Observation: the number of words following a given context is small. High-level idea: map a word I to the position it takes within its sibling Is (the Is following a context of fixed length k). k = 1 5
33 Context-based I Remapping Observation: the number of words following a given context is small. High-level idea: map a word I to the position it takes within its sibling Is (the Is following a context of fixed length k). k = 1 5
34 Context-based I Remapping Observation: the number of words following a given context is small. High-level idea: map a word I to the position it takes within its sibling Is (the Is following a context of fixed length k). Millions of unigrams. Height 5: longer contexts. k = 1 The number of siblings has a funnel-shaped distribution. 5
35 Context-based I Remapping Observation: the number of words following a given context is small. High-level idea: map a word I to the position it takes within its sibling Is (the Is following a context of fixed length k). Millions of unigrams. Height 5: longer contexts. k = 1 The number of siblings has a funnel-shaped distribution
36 Context-based I Remapping Observation: the number of words following a given context is small. High-level idea: map a word I to the position it takes within its sibling Is (the Is following a context of fixed length k). Millions of unigrams. Height 5: longer contexts. u/n by varying context-length k k = 1 The number of siblings has a funnel-shaped distribution
37 Context-based I Remapping Observation: the number of words following a given context is small. High-level idea: map a word I to the position it takes within its sibling Is (the Is following a context of fixed length k). Millions of unigrams. Height 5: longer contexts. u/n by varying context-length k k = 1 The number of siblings has a funnel-shaped distribution
38 Experimental nalysis - EF/PEF (R)Trie Test machine Intel Xeon E v3, 2.4 GHz 193 G of RM, Linux 64 bits C++ implementation gcc with the highest optimization setting 6
39 Experimental nalysis - EF/PEF (R)Trie Test machine Intel Xeon E v3, 2.4 GHz 193 G of RM, Linux 64 bits C++ implementation gcc with the highest optimization setting 6
40 Experimental nalysis - EF/PEF (R)Trie Test machine Intel Xeon E v3, 2.4 GHz 193 G of RM, Linux 64 bits C++ implementation gcc with the highest optimization setting 6
41 Experimental nalysis - EF/PEF (R)Trie Test machine Intel Xeon E v3, 2.4 GHz 193 G of RM, Linux 64 bits C++ implementation gcc with the highest optimization setting 6
42 Experimental nalysis - EF/PEF (R)Trie Test machine Intel Xeon E v3, 2.4 GHz 193 G of RM, Linux 64 bits C++ implementation gcc with the highest optimization setting Context-based I Remapping reduces space by more than 36% on average you will notice this! 6
43 Experimental nalysis - EF/PEF (R)Trie Test machine Intel Xeon E v3, 2.4 GHz 193 G of RM, Linux 64 bits C++ implementation gcc with the highest optimization setting Context-based I Remapping reduces space by more than 36% on average you will notice this! 6
44 Experimental nalysis - EF/PEF (R)Trie Test machine Intel Xeon E v3, 2.4 GHz 193 G of RM, Linux 64 bits C++ implementation gcc with the highest optimization setting Context-based I Remapping reduces space by more than 36% on average brings approximately 30% more time you will notice this! will you notice this? 6
45 Experimental nalysis - Overall comparison 7
46 Experimental nalysis - Overall comparison 7
47 Experimental nalysis - Overall comparison 2.3X 2.5X 7
48 Experimental nalysis - Overall comparison 2.3X 2.5X 7
49 Experimental nalysis - Overall comparison X 2X 2X 2X X 2X 2.3X 2.5X 3.5X 5.5X 2.8X 2.7X 2.5X 2.5X 3X 7
50 Experimental nalysis - Overall comparison X 2X 2X 2X X 2X 2.3X 2.5X 3.5X 5.5X 2.8X 2.7X 2.5X 2.5X 3X 7
51 Experimental nalysis - Overall comparison X 2X 2X 2X X 2X 2.3X 2.5X 3.5X 5.5X 2.8X 2.7X 2.5X 2.5X 3X Elias-Fano Tries substantially outperform LL previous solutions in both space and time. s fast as the state-of-the-art (KenLM) but more than twice smaller. 7
52 Experimental nalysis - Perplexity Reversed Elias-Fano Tries Probabilities and backoffs are quantized (binning method) using any number of bits from 2 to 32 Stateful scoring function Elias-Fano Tries substantially outperform LL previous solutions in both space and time. s fast as the state-of-the-art (KenLM) but up to 65% more space-efficient. 8
53 Experimental nalysis - Perplexity Reversed Elias-Fano Tries Probabilities and backoffs are quantized (binning method) using any number of bits from 2 to 32 Stateful scoring function Elias-Fano Tries substantially outperform LL previous solutions in both space and time. s fast as the state-of-the-art (KenLM) but up to 65% more space-efficient. 8
54 Experimental nalysis - Perplexity Reversed Elias-Fano Tries Probabilities and backoffs are quantized (binning method) using any number of bits from 2 to 32 Stateful scoring function +60% +65% Elias-Fano Tries substantially outperform LL previous solutions in both space and time. s fast as the state-of-the-art (KenLM) but up to 65% more space-efficient. 8
55 Experimental nalysis - Perplexity Reversed Elias-Fano Tries Probabilities and backoffs are quantized (binning method) using any number of bits from 2 to 32 Stateful scoring function +60% +65% Elias-Fano Tries substantially outperform LL previous solutions in both space and time. s fast as the state-of-the-art (KenLM) but up to 65% more space-efficient. 8
56 Experimental nalysis - Perplexity Reversed Elias-Fano Tries Probabilities and backoffs are quantized (binning method) using any number of bits from 2 to 32 Stateful scoring function 2X 4X 2X 2.7X 3X 3.3X 2.5X 2X 4X +25% 3.5X +40% 15X +80% 36X +60% +65% +30% 25X +20% 16X Elias-Fano Tries substantially outperform LL previous solutions in both space and time. s fast as the state-of-the-art (KenLM) but up to 65% more space-efficient. 8
57 9
58 9
59 On-going work Parallel and scalable estimation of Kneser-Ney language models Python wrapper, installable through pip utility 9
60 On-going work Parallel and scalable estimation of Kneser-Ney language models Python wrapper, installable through pip utility Future work Optimal I-assignment for Elias-Fano? (NP-hard problem) Make queries (especially perplexity) even faster 9
61 Proudly supported by a Student Travel Grant Thanks for your attention, time, patience! ny questions? 10
Efficient Data Structures for Massive N -Gram Datasets
Efficient Data Structures for Massive N -Gram Datasets Giulio Ermanno Pibiri Dept. of Computer Science, University of Pisa, Italy giulio.pibiri@di.unipi.it ABSTRACT The efficient indexing of large and
More informationarxiv: v1 [cs.ir] 25 Jun 2018
Handling Massive N -Gram Datasets Efficiently ariv:806.09447v [cs.ir] 25 Jun 208 GIULIO ERMNNO PIBIRI and ROSSNO VENTURINI, University of Pisa and ISTI-NR, Italy bstract. This paper deals with the two
More informationKenLM: Faster and Smaller Language Model Queries
KenLM: Faster and Smaller Language Model Queries Kenneth heafield@cs.cmu.edu Carnegie Mellon July 30, 2011 kheafield.com/code/kenlm What KenLM Does Answer language model queries using less time and memory.
More informationFaster and Smaller N-Gram Language Models
Faster and Smaller N-Gram Language Models Adam Pauls Dan Klein Computer Science Division University of California, Berkeley {adpauls,klein}@csberkeleyedu Abstract N-gram language models are a major resource
More informationN-gram language models for massively parallel devices
N-gram language models for massively parallel devices Nikolay Bogoychev and Adam Lopez University of Edinburgh Edinburgh, United Kingdom Abstract For many applications, the query speed of N-gram language
More informationNatural Language Processing
Natural Language Processing Language Models Language models are distributions over sentences N gram models are built from local conditional probabilities Language Modeling II Dan Klein UC Berkeley, The
More informationCS 288: Statistical NLP Assignment 1: Language Modeling
CS 288: Statistical NLP Assignment 1: Language Modeling Due September 12, 2014 Collaboration Policy You are allowed to discuss the assignment with other students and collaborate on developing algorithms
More informationN-Gram Language Modelling including Feed-Forward NNs. Kenneth Heafield. University of Edinburgh
N-Gram Language Modelling including Feed-Forward NNs Kenneth Heafield University of Edinburgh History of Language Model History Kenneth Heafield University of Edinburgh 3 p(type Predictive) > p(tyler Predictive)
More informationarxiv: v1 [cs.ir] 29 Apr 2018
Variable-Byte Encoding Is Now Space-Efficient Too Giulio Ermanno Pibiri University of Pisa and ISTI-CNR Pisa, Italy giulio.pibiri@di.unipi.it Rossano Venturini University of Pisa and ISTI-CNR Pisa, Italy
More informationStoring the Web in Memory: Space Efficient Language Models with Constant Time Retrieval
Storing the Web in Memory: Space Efficient Language Models with Constant Time Retrieval David Guthrie Computer Science Department University of Sheffield D.Guthrie@dcs.shef.ac.uk Mark Hepple Computer Science
More informationn-gram A Large-scale N-gram Search System Using Inverted Files Susumu Yata, 1, 2 Kazuhiro Morita, 1 Masao Fuketa 1 and Jun-ichi Aoe 1 TSUBAKI 2)
n-gram 1, 2 1 1 1 n-gram n-gram A Large-scale N-gram Search System Using Inverted Files Susumu Yata, 1, 2 Kazuhiro Morita, 1 Masao Fuketa 1 and Jun-ichi Aoe 1 Recently, a search system has been proposed
More informationSmoothing. BM1: Advanced Natural Language Processing. University of Potsdam. Tatjana Scheffler
Smoothing BM1: Advanced Natural Language Processing University of Potsdam Tatjana Scheffler tatjana.scheffler@uni-potsdam.de November 1, 2016 Last Week Language model: P(Xt = wt X1 = w1,...,xt-1 = wt-1)
More informationAlgorithms for NLP. Language Modeling II. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley
Algorithms for NLP Language Modeling II Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley Announcements Should be able to really start project after today s lecture Get familiar with bit-twiddling
More informationOutline GIZA++ Moses. Demo. Steps Output files. Training pipeline Decoder
GIZA++ and Moses Outline GIZA++ Steps Output files Moses Training pipeline Decoder Demo GIZA++ A statistical machine translation toolkit used to train IBM Models 1-5 (moses only uses output of IBM Model-1)
More informationInverted Index Compression
Inverted Index Compression Giulio Ermanno Pibiri and Rossano Venturini Department of Computer Science, University of Pisa, Italy giulio.pibiri@di.unipi.it and rossano.venturini@unipi.it Abstract The data
More informationIndexing. UCSB 290N. Mainly based on slides from the text books of Croft/Metzler/Strohman and Manning/Raghavan/Schutze
Indexing UCSB 290N. Mainly based on slides from the text books of Croft/Metzler/Strohman and Manning/Raghavan/Schutze All slides Addison Wesley, 2008 Table of Content Inverted index with positional information
More informationImproved Address-Calculation Coding of Integer Arrays
Improved Address-Calculation Coding of Integer Arrays Jyrki Katajainen 1,2 Amr Elmasry 3, Jukka Teuhola 4 1 University of Copenhagen 2 Jyrki Katajainen and Company 3 Alexandria University 4 University
More informationNatural Language Processing with Deep Learning CS224N/Ling284
Natural Language Processing with Deep Learning CS224N/Ling284 Lecture 8: Recurrent Neural Networks Christopher Manning and Richard Socher Organization Extra project office hour today after lecture Overview
More informationCS159 - Assignment 2b
CS159 - Assignment 2b Due: Tuesday, Sept. 23 at 2:45pm For the main part of this assignment we will be constructing a number of smoothed versions of a bigram language model and we will be evaluating its
More informationHustle Documentation. Release 0.1. Tim Spurway
Hustle Documentation Release 0.1 Tim Spurway February 26, 2014 Contents 1 Features 3 2 Getting started 5 2.1 Installing Hustle............................................. 5 2.2 Hustle Tutorial..............................................
More informationOptimal Space-time Tradeoffs for Inverted Indexes
Optimal Space-time Tradeoffs for Inverted Indexes Giuseppe Ottaviano 1, Nicola Tonellotto 1, Rossano Venturini 1,2 1 National Research Council of Italy, Pisa, Italy 2 Department of Computer Science, University
More informationSampling Table Configulations for the Hierarchical Poisson-Dirichlet Process
Sampling Table Configulations for the Hierarchical Poisson-Dirichlet Process Changyou Chen,, Lan Du,, Wray Buntine, ANU College of Engineering and Computer Science The Australian National University National
More informationUsing Graphics Processors for High Performance IR Query Processing
Using Graphics Processors for High Performance IR Query Processing Shuai Ding Jinru He Hao Yan Torsten Suel Polytechnic Inst. of NYU Polytechnic Inst. of NYU Polytechnic Inst. of NYU Yahoo! Research Brooklyn,
More informationIndexing. CS6200: Information Retrieval. Index Construction. Slides by: Jesse Anderton
Indexing Index Construction CS6200: Information Retrieval Slides by: Jesse Anderton Motivation: Scale Corpus Terms Docs Entries A term incidence matrix with V terms and D documents has O(V x D) entries.
More informationLign/CSE 256, Programming Assignment 1: Language Models
Lign/CSE 256, Programming Assignment 1: Language Models 16 January 2008 due 1 Feb 2008 1 Preliminaries First, make sure you can access the course materials. 1 The components are: ˆ code1.zip: the Java
More informationEfficient Diversification of Web Search Results
Efficient Diversification of Web Search Results G. Capannini, F. M. Nardini, R. Perego, and F. Silvestri ISTI-CNR, Pisa, Italy Laboratory Web Search Results Diversification Query: Vinci, what is the user
More informationSpatio-temporal Range Searching Over Compressed Kinetic Sensor Data. Sorelle A. Friedler Google Joint work with David M. Mount
Spatio-temporal Range Searching Over Compressed Kinetic Sensor Data Sorelle A. Friedler Google Joint work with David M. Mount Motivation Kinetic data: data generated by moving objects Sensors collect data
More informationCMPSCI 646, Information Retrieval (Fall 2003)
CMPSCI 646, Information Retrieval (Fall 2003) Midterm exam solutions Problem CO (compression) 1. The problem of text classification can be described as follows. Given a set of classes, C = {C i }, where
More informationPatrick Nguyen, Jianfeng Gao, and Milind Mahajan November Technical Report MSR-TR
MSRLM: a scalable language modeling toolkit Patrick Nguyen, Jianfeng Gao, and Milind Mahajan {panguyen,jfgao,milindm}@microsoft.com November 2007 Technical Report MSR-TR-2007-144 Microsoft Research Microsoft
More informationPredictive Indexing for Fast Search
Predictive Indexing for Fast Search Sharad Goel, John Langford and Alex Strehl Yahoo! Research, New York Modern Massive Data Sets (MMDS) June 25, 2008 Goel, Langford & Strehl (Yahoo! Research) Predictive
More informationCompressed Indexes for String Searching in Labeled Graphs
Compressed Indexes for String Searching in Labeled Graphs Paolo Ferragina Francesco Piccinno Rossano Venturini Dipartimento di Informatica, University of Pisa {ferragina, piccinno, rossano}@di.unipi.it
More informationBloom Filter and Lossy Dictionary Based Language Models
Bloom Filter and Lossy Dictionary Based Language Models Abby D. Levenberg E H U N I V E R S I T Y T O H F R G E D I N B U Master of Science Cognitive Science and Natural Language Processing School of Informatics
More informationFast and compact Hamming distance index
Fast and compact Hamming distance index Simon Gog Institute of Theoretical Informatics, Karlsruhe Institute of Technology gog@kit.edu Rossano Venturini University of Pisa and Istella Srl rossano.venturini@unipi.it
More informationLecture 24: Image Retrieval: Part II. Visual Computing Systems CMU , Fall 2013
Lecture 24: Image Retrieval: Part II Visual Computing Systems Review: K-D tree Spatial partitioning hierarchy K = dimensionality of space (below: K = 2) 3 2 1 3 3 4 2 Counts of points in leaf nodes Nearest
More informationA DEDUPLICATION-INSPIRED FAST DELTA COMPRESSION APPROACH W EN XIA, HONG JIANG, DA N FENG, LEI T I A N, M I N FU, YUKUN Z HOU
A DEDUPLICATION-INSPIRED FAST DELTA COMPRESSION APPROACH W EN XIA, HONG JIANG, DA N FENG, LEI T I A N, M I N FU, YUKUN Z HOU PRESENTED BY ROMAN SHOR Overview Technics of data reduction in storage systems:
More informationAccelerating Parallel Analysis of Scientific Simulation Data via Zazen
Accelerating Parallel Analysis of Scientific Simulation Data via Zazen Tiankai Tu, Charles A. Rendleman, Patrick J. Miller, Federico Sacerdoti, Ron O. Dror, and David E. Shaw D. E. Shaw Research Motivation
More informationB. Evaluation and Exploration of Next Generation Systems for Applicability and Performance (Volodymyr Kindratenko, Guochun Shi)
A. Summary - In the area of Evaluation and Exploration of Next Generation Systems for Applicability and Performance, over the period of 01/01/11 through 03/31/11 the NCSA Innovative Systems Lab team investigated
More informationAn Oracle White Paper September Oracle Utilities Meter Data Management Demonstrates Extreme Performance on Oracle Exadata/Exalogic
An Oracle White Paper September 2011 Oracle Utilities Meter Data Management 2.0.1 Demonstrates Extreme Performance on Oracle Exadata/Exalogic Introduction New utilities technologies are bringing with them
More informationScalable Compression and Transmission of Large, Three- Dimensional Materials Microstructures
Scalable Compression and Transmission of Large, Three- Dimensional Materials Microstructures William A. Pearlman Center for Image Processing Research Rensselaer Polytechnic Institute pearlw@ecse.rpi.edu
More informationDistributed N-Gram Language Models: Application of Large Models to Automatic Speech Recognition
Karlsruhe Institute of Technology Department of Informatics Institute for Anthropomatics Distributed N-Gram Language Models: Application of Large Models to Automatic Speech Recognition Christian Mandery
More informationHardware Acceleration in Computer Networks. Jan Kořenek Conference IT4Innovations, Ostrava
Hardware Acceleration in Computer Networks Outline Motivation for hardware acceleration Longest prefix matching using FPGA Hardware acceleration of time critical operations Framework and applications Contracted
More informationCSI33 Data Structures
Outline Department of Mathematics and Computer Science Bronx Community College September 6, 2017 Outline Outline 1 Chapter 2: Data Abstraction Outline Chapter 2: Data Abstraction 1 Chapter 2: Data Abstraction
More informationEnhancing Bitmap Indices
Enhancing Bitmap Indices Guadalupe Canahuate The Ohio State University Advisor: Hakan Ferhatosmanoglu Introduction n Bitmap indices: Widely used in data warehouses and read-only domains Implemented in
More informationEfficient Representation of Local Geometry for Large Scale Object Retrieval
Efficient Representation of Local Geometry for Large Scale Object Retrieval Michal Perďoch Ondřej Chum and Jiří Matas Center for Machine Perception Czech Technical University in Prague IEEE Computer Society
More informationCS 206 Introduction to Computer Science II
CS 206 Introduction to Computer Science II 04 / 25 / 2018 Instructor: Michael Eckmann Today s Topics Questions? Comments? Balanced Binary Search trees AVL trees / Compression Uses binary trees Balanced
More informationA Retrieval Method for Double Array Structures by Using Byte N-Gram
A Retrieval Method for Double Array Structures by Using Byte N-Gram Masao Fuketa, Kazuhiro Morita, and Jun-Ichi Aoe Abstract Retrieving keywords requires speed and compactness. A trie is one of the data
More informationOpen-Source Speech Recognition for Hand-held and Embedded Devices
PocketSphinx: Open-Source Speech Recognition for Hand-held and Embedded Devices David Huggins Daines (dhuggins@cs.cmu.edu) Mohit Kumar (mohitkum@cs.cmu.edu) Arthur Chan (archan@cs.cmu.edu) Alan W Black
More informationAdding Integers. Unit 1 Lesson 6
Unit 1 Lesson 6 Students will be able to: Add integers using rules and number line Key Vocabulary: An integer Number line Rules for Adding Integers There are two rules that you must follow when adding
More informationEfficient Top-k Shortest-Path Distance Queries on Large Networks by Pruned Landmark Labeling with Application to Network Structure Prediction
Efficient Top-k Shortest-Path Distance Queries on Large Networks by Pruned Landmark Labeling with Application to Network Structure Prediction Takuya Akiba U Tokyo Takanori Hayashi U Tokyo Nozomi Nori Kyoto
More informationIndexing and Searching
Indexing and Searching Introduction How to retrieval information? A simple alternative is to search the whole text sequentially Another option is to build data structures over the text (called indices)
More informationgcc o driver std=c99 -Wall driver.c bigmesa.c
C Programming Simple Array Processing This assignment consists of two parts. The first part focuses on array read accesses and computational logic. The second part focuses on array read/write access and
More informationR16 SET - 1 '' ''' '' ''' Code No: R
1. a) Define Latency time and Transmission time? (2M) b) Define Hash table and Hash function? (2M) c) Explain the Binary Heap Structure Property? (3M) d) List the properties of Red-Black trees? (3M) e)
More informationIndexing Variable Length Substrings for Exact and Approximate Matching
Indexing Variable Length Substrings for Exact and Approximate Matching Gonzalo Navarro 1, and Leena Salmela 2 1 Department of Computer Science, University of Chile gnavarro@dcc.uchile.cl 2 Department of
More informationDigital Communication Prof. Bikash Kumar Dey Department of Electrical Engineering Indian Institute of Technology, Bombay
Digital Communication Prof. Bikash Kumar Dey Department of Electrical Engineering Indian Institute of Technology, Bombay Lecture - 26 Source Coding (Part 1) Hello everyone, we will start a new module today
More informationOptimizing the use of the Hard Disk in MapReduce Frameworks for Multi-core Architectures*
Optimizing the use of the Hard Disk in MapReduce Frameworks for Multi-core Architectures* Tharso Ferreira 1, Antonio Espinosa 1, Juan Carlos Moure 2 and Porfidio Hernández 2 Computer Architecture and Operating
More informationLecture 5: Information Retrieval using the Vector Space Model
Lecture 5: Information Retrieval using the Vector Space Model Trevor Cohn (tcohn@unimelb.edu.au) Slide credits: William Webber COMP90042, 2015, Semester 1 What we ll learn today How to take a user query
More informationA Performance Evaluation of the Preprocessing Phase of Multiple Keyword Matching Algorithms
A Performance Evaluation of the Preprocessing Phase of Multiple Keyword Matching Algorithms Charalampos S. Kouzinopoulos and Konstantinos G. Margaritis Parallel and Distributed Processing Laboratory Department
More informationDeveloping MapReduce Programs
Cloud Computing Developing MapReduce Programs Dell Zhang Birkbeck, University of London 2017/18 MapReduce Algorithm Design MapReduce: Recap Programmers must specify two functions: map (k, v) * Takes
More informationMobile Visual Search with Word-HOG Descriptors
Mobile Visual Search with Word-HOG Descriptors Sam S. Tsai, Huizhong Chen, David M. Chen, and Bernd Girod Department of Electrical Engineering, Stanford University, Stanford, CA, 9435 sstsai@alumni.stanford.edu,
More informationCS 124/LING 180/LING 280 From Languages to Information Week 2: Group Exercises on Language Modeling Winter 2018
CS 124/LING 180/LING 280 From Languages to Information Week 2: Group Exercises on Language Modeling Winter 2018 Dan Jurafsky Tuesday, January 23, 2018 1 Part 1: Group Exercise We are interested in building
More informationRecommender Systems New Approaches with Netflix Dataset
Recommender Systems New Approaches with Netflix Dataset Robert Bell Yehuda Koren AT&T Labs ICDM 2007 Presented by Matt Rodriguez Outline Overview of Recommender System Approaches which are Content based
More informationLinked Structures Songs, Games, Movies Part IV. Fall 2013 Carola Wenk
Linked Structures Songs, Games, Movies Part IV Fall 23 Carola Wenk Storing Text We ve been focusing on numbers. What about text? Animal, Bird, Cat, Car, Chase, Camp, Canal We can compare the lexicographic
More informationPre-Computable Multi-Layer Neural Network Language Models
Pre-Computable Multi-Layer Neural Network Language Models Jacob Devlin jdevlin@microsoft.com Chris Quirk chrisq@microsoft.com Arul Menezes arulm@microsoft.com Abstract In the last several years, neural
More informationAccelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs
Accelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs Ritchie Zhao 1, Weinan Song 2, Wentao Zhang 2, Tianwei Xing 3, Jeng-Hau Lin 4, Mani Srivastava 3, Rajesh Gupta 4, Zhiru
More informationEntity and Knowledge Base-oriented Information Retrieval
Entity and Knowledge Base-oriented Information Retrieval Presenter: Liuqing Li liuqing@vt.edu Digital Library Research Laboratory Virginia Polytechnic Institute and State University Blacksburg, VA 24061
More informationSchema Extraction for Tabular Data on the Web
Schema Extraction for abular ata on the Web Marco delfio anan Samet epartment of Computer Science Center for utomation Research Institute for dvanced Computer Studies University of Maryland College Park,
More informationCMSC 723: Computational Linguistics I Session #12 MapReduce and Data Intensive NLP. University of Maryland. Wednesday, November 18, 2009
CMSC 723: Computational Linguistics I Session #12 MapReduce and Data Intensive NLP Jimmy Lin and Nitin Madnani University of Maryland Wednesday, November 18, 2009 Three Pillars of Statistical NLP Algorithms
More informationDatenbanksysteme II: Modern Hardware. Stefan Sprenger November 23, 2016
Datenbanksysteme II: Modern Hardware Stefan Sprenger November 23, 2016 Content of this Lecture Introduction to Modern Hardware CPUs, Cache Hierarchy Branch Prediction SIMD NUMA Cache-Sensitive Skip List
More informationDeep Character-Level Click-Through Rate Prediction for Sponsored Search
Deep Character-Level Click-Through Rate Prediction for Sponsored Search Bora Edizel - Phd Student UPF Amin Mantrach - Criteo Research Xiao Bai - Oath This work was done at Yahoo and will be presented as
More informationA Space-Efficient Phrase Table Implementation Using Minimal Perfect Hash Functions
A Space-Efficient Phrase Table Implementation Using Minimal Perfect Hash Functions Marcin Junczys-Dowmunt Faculty of Mathematics and Computer Science Adam Mickiewicz University ul. Umultowska 87, 61-614
More informationAlgorithms for NLP. Language Modeling II. Taylor Berg- Kirkpatrick CMU Slides: Dan Klein UC Berkeley
Algorithms for NLP Language Modeling II Taylor Berg- Kirkpatrick CMU Slides: Dan Klein UC Berkeley Announcements Should be able to really start project ager today s lecture Get familiar with bit- twiddling
More informationA Secondary storage Algorithms and Data Structures Supplementary Questions and Exercises
308-420A Secondary storage Algorithms and Data Structures Supplementary Questions and Exercises Section 1.2 4, Logarithmic Files Logarithmic Files 1. A B-tree of height 6 contains 170,000 nodes with an
More informationCS294-1 Assignment 2 Report
CS294-1 Assignment 2 Report Keling Chen and Huasha Zhao February 24, 2012 1 Introduction The goal of this homework is to predict a users numeric rating for a book from the text of the user s review. The
More informationOpenCL Vectorising Features. Andreas Beckmann
Mitglied der Helmholtz-Gemeinschaft OpenCL Vectorising Features Andreas Beckmann Levels of Vectorisation vector units, SIMD devices width, instructions SMX, SP cores Cus, PEs vector operations within kernels
More informationCS6200 Information Retrieval. David Smith College of Computer and Information Science Northeastern University
CS6200 Information Retrieval David Smith College of Computer and Information Science Northeastern University Indexing Process!2 Indexes Storing document information for faster queries Indexes Index Compression
More information15 July, Huffman Trees. Heaps
1 Huffman Trees The Huffman Code: Huffman algorithm uses a binary tree to compress data. It is called the Huffman code, after David Huffman who discovered d it in 1952. Data compression is important in
More informationAutomatic Semantic Platform- dependent Redesign. Giulio Mori & Fabio Paternò. HIIS Laboratory ISTI-C.N.R.
Automatic Semantic Platform- dependent Redesign Giulio Mori & Fabio Paternò http://giove.isti.cnr.it/ HIIS Laboratory ISTI-C.N.R. Pisa, Italy Multi-Device Interactive Services: Current Practice Manual
More informationThe Adaptive Radix Tree
Department of Informatics, University of Zürich MSc Basismodul The Adaptive Radix Tree Rafael Kallis Matrikelnummer: -708-887 Email: rk@rafaelkallis.com September 8, 08 supervised by Prof. Dr. Michael
More informationFastText. Jon Koss, Abhishek Jindal
FastText Jon Koss, Abhishek Jindal FastText FastText is on par with state-of-the-art deep learning classifiers in terms of accuracy But it is way faster: FastText can train on more than one billion words
More informationDatabase System Concepts, 6 th Ed. Silberschatz, Korth and Sudarshan See for conditions on re-use
Chapter 11: Indexing and Hashing Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B + -Tree Index Files Static
More informationPresented by: Dardan Xhymshiti Spring 2016
Presented by: Dardan Xhymshiti Spring 2016 Authors: Sean Chester* Darius Sidlauskas` Ira Assent* Kenneth S. Bogh* *Data-Intensive Systems Group, Aarhus University, Denmakr `Data-Intensive Applications
More informationProgramming and Data Structure Laboratory (CS13002)
Programming and Data Structure Laboratory (CS13002) Dr. Sudeshna Sarkar Dr. Indranil Sengupta Dept. of Computer Science & Engg., IIT Kharagpur 1 Some Rules to be Followed Attendance is mandatory. Regular
More informationAn empirical study of smoothing techniques for language modeling
Computer Speech and Language (1999) 13, 359 394 Article No. csla.1999.128 Available online at http://www.idealibrary.com on An empirical study of smoothing techniques for language modeling Stanley F. Chen
More informationTrafficDB: HERE s High Performance Shared-Memory Data Store Ricardo Fernandes, Piotr Zaczkowski, Bernd Göttler, Conor Ettinoffe, and Anis Moussa
TrafficDB: HERE s High Performance Shared-Memory Data Store Ricardo Fernandes, Piotr Zaczkowski, Bernd Göttler, Conor Ettinoffe, and Anis Moussa EPL646: Advanced Topics in Databases Christos Hadjistyllis
More information15110 PRINCIPLES OF COMPUTING SAMPLE EXAM 2
15110 PRINCIPLES OF COMPUTING SAMPLE EXAM 2 Name Section Directions: Answer each question neatly in the space provided. Please read each question carefully. You have 50 minutes for this exam. No electronic
More informationA Compressed Cache-Oblivious String B-tree
A Compressed Cache-Oblivious String B-tree PAOLO FERRAGINA, Department of Computer Science, University of Pisa ROSSANO VENTURINI, Department of Computer Science, University of Pisa In this paper we study
More informationIterative CKY parsing for Probabilistic Context-Free Grammars
Iterative CKY parsing for Probabilistic Context-Free Grammars Yoshimasa Tsuruoka and Jun ichi Tsujii Department of Computer Science, University of Tokyo Hongo 7-3-1, Bunkyo-ku, Tokyo 113-0033 CREST, JST
More informationarxiv: v1 [cs.cl] 16 Aug 2016
Fast, Small and Exact: Infinite-order Language Modelling with Compressed Suffix Trees arxiv:1608.04465v1 [cs.cl] 16 Aug 2016 Ehsan Shareghi, Matthias Petri, Gholamreza Haffari and Trevor Cohn Faculty of
More informationWebSAIL Wikifier at ERD 2014
WebSAIL Wikifier at ERD 2014 Thanapon Noraset, Chandra Sekhar Bhagavatula, Doug Downey Department of Electrical Engineering & Computer Science, Northwestern University {nor.thanapon, csbhagav}@u.northwestern.edu,ddowney@eecs.northwestern.edu
More informationSo you haven t upgraded to MapInfo 64-bit yet?
MapInfo v16 So you haven t upgraded to MapInfo 64-bit yet? This document provides a quick overview of the important features and improvements of the current 64-bit release for those customers who have
More informationInverted List Caching for Topical Index Shards
Inverted List Caching for Topical Index Shards Zhuyun Dai and Jamie Callan Language Technologies Institute, Carnegie Mellon University {zhuyund, callan}@cs.cmu.edu Abstract. Selective search is a distributed
More informationSparse Non-negative Matrix Language Modeling
Sparse Non-negative Matrix Language Modeling Joris Pelemans Noam Shazeer Ciprian Chelba joris@pelemans.be noam@google.com ciprianchelba@google.com 1 Outline Motivation Sparse Non-negative Matrix Language
More informationLinux Kernel Hacking Free Course
Linux Kernel Hacking Free Course 3 rd edition G.Grilli, University of me Tor Vergata IRQ DISTRIBUTION IN MULTIPROCESSOR SYSTEMS April 05, 2006 IRQ distribution in multiprocessor systems 1 Contents: What
More informationApplications of Succinct Dynamic Compact Tries to Some String Problems
Applications of Succinct Dynamic Compact Tries to Some String Problems Takuya Takagi 1, Takashi Uemura 2, Shunsuke Inenaga 3, Kunihiko Sadakane 4, and Hiroki Arimura 1 1 IST & School of Engineering, Hokkaido
More informationScalable Trigram Backoff Language Models
Scalable Trigram Backoff Language Models Kristie Seymore Ronald Rosenfeld May 1996 CMU-CS-96-139 School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213 This material is based upon work
More informationTrees in java.util. A set is an object that stores unique elements In Java, two implementations are available:
Trees in java.util A set is an object that stores unique elements In Java, two implementations are available: The class HashSet implements the set with a hash table and a hash function The class TreeSet,
More informationComputing n-gram Statistics in MapReduce
Computing n-gram Statistics in MapReduce Klaus Berberich (kberberi@mpi-inf.mpg.de) Srikanta Bedathur (bedathur@iiitd.ac.in) n-gram Statistics Statistics about variable-length word sequences (e.g., lord
More informationCOS 226 Algorithms and Data Structures Fall Final Solutions. 10. Remark: this is essentially the same question from the midterm.
COS 226 Algorithms and Data Structures Fall 2011 Final Solutions 1 Analysis of algorithms (a) T (N) = 1 10 N 5/3 When N increases by a factor of 8, the memory usage increases by a factor of 32 Thus, T
More informationProgramming Studio #1 ECE 190
Programming Studio #1 ECE 190 Programming Studio #1 Announcements In Studio Assignment Introduction to Linux Command-Line Operations Recitation Floating Point Representation Binary & Hexadecimal 2 s Complement
More informationSIGIR 2016 Richard Zanibbi, Kenny Davila, Andrew Kane, Frank W. Tompa July 18, 2016
SIGIR 2016 Richard Zanibbi, Kenny Davila, Andrew Kane, Frank W. Tompa July 18, 2016 Mathematical Information Retrieval (MIR) Many mathematical resources are available online, such as: Online databases
More information