Search Engines. Gertjan van Noord. September 17, 2018

Similar documents
Tolerant Retrieval. Searching the Dictionary Tolerant Retrieval. Information Retrieval & Extraction Misbhauddin 1

3-1. Dictionaries and Tolerant Retrieval. Most slides were adapted from Stanford CS 276 course and University of Munich IR course.

Recap of the previous lecture. This lecture. A naïve dictionary. Introduction to Information Retrieval. Dictionary data structures Tolerant retrieval

Overview. Lecture 3: Index Representation and Tolerant Retrieval. Type/token distinction. IR System components

Text Technologies for Data Science INFR Indexing (2) Instructor: Walid Magdy

Text Technologies for Data Science INFR Indexing (2) Instructor: Walid Magdy

Dictionaries and Tolerant retrieval

Information Retrieval

Recap of last time CS276A Information Retrieval

Dictionaries and tolerant retrieval. Slides by Manning, Raghavan, Schutze

Information Retrieval

Introduction to Information Retrieval (Manning, Raghavan, Schutze)

Indexing and Searching

Data-analysis and Retrieval Boolean retrieval, posting lists and dictionaries

Indexing and Searching

Information Retrieval

Dictionaries and tolerant retrieval CE-324 : Modern Information Retrieval Sharif University of Technology

CSCI 104 Tries. Mark Redekopp David Kempe

Chapter 11: Indexing and Hashing

Data Structures and Methods. Johan Bollen Old Dominion University Department of Computer Science

Lecture Notes on Tries

Overview of Storage and Indexing

Lecture 3: Phrasal queries and wildcards

Indexing and Searching

Inverted Indexes. Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 p. 5

Indexing. UCSB 290N. Mainly based on slides from the text books of Croft/Metzler/Strohman and Manning/Raghavan/Schutze

OUTLINE. Documents Terms. General + Non-English English. Skip pointers. Phrase queries

Suffix-based text indices, construction algorithms, and applications.

Methods for High Degrees of Similarity. Index-Based Methods Exploiting Prefixes and Suffixes Exploiting Length

Birkbeck (University of London)

Predecessor. Predecessor Problem van Emde Boas Tries. Philip Bille

Boolean Queries. Keywords combined with Boolean operators:

Chapter 12: Indexing and Hashing. Basic Concepts

Index extensions. Manning, Raghavan and Schütze, Chapter 2. Daniël de Kok

Chapter 12: Indexing and Hashing

4.1 COMPUTATIONAL THINKING AND PROBLEM-SOLVING

Dictionaries and tolerant retrieval CE-324 : Modern Information Retrieval Sharif University of Technology

Lecture Notes on Tries

Predecessor. Predecessor. Predecessors. Predecessors. Predecessor Problem van Emde Boas Tries. Predecessor Problem van Emde Boas Tries.

CPSC 340: Machine Learning and Data Mining. Finding Similar Items Fall 2017

Boolean Retrieval. Manning, Raghavan and Schütze, Chapter 1. Daniël de Kok

Indexing. Week 14, Spring Edited by M. Naci Akkøk, , Contains slides from 8-9. April 2002 by Hector Garcia-Molina, Vera Goebel

Information Retrieval

Introduction to Parallel Computing Errata

1 Probability Review. CS 124 Section #8 Hashing, Skip Lists 3/20/17. Expectation (weighted average): the expectation of a random quantity X is:

Information Retrieval

Methods for High Degrees of Similarity. Index-Based Methods Exploiting Prefixes and Suffixes Exploiting Length

More B-trees, Hash Tables, etc. CS157B Chris Pollett Feb 21, 2005.

Announcements. Reading Material. Recap. Today 9/17/17. Storage (contd. from Lecture 6)

Finding Similar Sets. Applications Shingling Minhashing Locality-Sensitive Hashing

Non-word Error Detection and Correction

Physical Level of Databases: B+-Trees


Information Retrieval 6. Index compression

Text Analytics. Index-Structures for Information Retrieval. Ulf Leser

x-fast and y-fast Tries

Indexing. Lecture Objectives. Text Technologies for Data Science INFR Learn about and implement Boolean search Inverted index Positional index

11/5/13 Comp 555 Fall

Solutions to Problem Set 1

String Matching. Pedro Ribeiro 2016/2017 DCC/FCUP. Pedro Ribeiro (DCC/FCUP) String Matching 2016/ / 42

ECE 122. Engineering Problem Solving Using Java

SUMMARY OF DATABASE STORAGE AND QUERYING

x-fast and y-fast Tries

CSCI3381-Cryptography

Dictionaries. Priority Queues

58093 String Processing Algorithms. Lectures, Autumn 2013, period II

Indexing: Part IV. Announcements (February 17) Keyword search. CPS 216 Advanced Database Systems

Trees in java.util. A set is an object that stores unique elements In Java, two implementations are available:

Introduction. hashing performs basic operations, such as insertion, better than other ADTs we ve seen so far

ΕΠΛ660. Ανάκτηση µε το µοντέλο διανυσµατικού χώρου

A Modern spell(1) Abhinav Upadhyay EuroBSDCon 2017, Paris

Introduction p. 1 Pseudocode p. 2 Algorithm Header p. 2 Purpose, Conditions, and Return p. 3 Statement Numbers p. 4 Variables p. 4 Algorithm Analysis

Information Retrieval CS-E credits

Information Retrieval

R16 SET - 1 '' ''' '' ''' Code No: R

Predecessor Data Structures. Philip Bille

11/5/09 Comp 590/Comp Fall

Efficiency. Efficiency: Indexing. Indexing. Efficiency Techniques. Inverted Index. Inverted Index (COSC 488)

CLOSET+:Searching for the Best Strategies for Mining Frequent Closed Itemsets

Chapter 17 Indexing Structures for Files and Physical Database Design

Lecture Notes on Tries

IN4325 Indexing and query processing. Claudia Hauff (WIS, TU Delft)

Chapter 5 Hashing. Introduction. Hashing. Hashing Functions. hashing performs basic operations, such as insertion,

CISC-235* Test #3 March 19, 2018

CS347. Lecture 2 April 9, Prabhakar Raghavan

Data Structures and Algorithms 2018

Today s topics CS347. Inverted index storage. Inverted index storage. Processing Boolean queries. Lecture 2 April 9, 2001 Prabhakar Raghavan

MITOCW watch?v=ninwepprkdq

Text Analytics. Index-Structures for Information Retrieval. Ulf Leser

Suffix Trees and Arrays

CS61B Fall 2015 Guerrilla Section 3 Worksheet. 8 November 2015

MARKING KEY The University of British Columbia MARKING KEY Computer Science 260 Midterm #2 Examination 12:30 noon, Thursday, March 15, 2012

Outline for Today. How can we speed up operations that work on integer data? A simple data structure for ordered dictionaries.

Table of Contents. Chapter 1: Introduction to Data Structures... 1

Index-assisted approximate matching

CMSC424: Database Design. Instructor: Amol Deshpande

Information Retrieval

Outline for Today. How can we speed up operations that work on integer data? A simple data structure for ordered dictionaries.

Information Retrieval

Kathleen Durant PhD Northeastern University CS Indexes

Transcription:

Search Engines Gertjan van Noord September 17, 2018

About the course Information about the course is available from: http://www.let.rug.nl/vannoord/college/zoekmachines/

Last week Normalization (case, diacritics, stemming, decompounding,... ) Posting List intersection with Skip Pointers Phrase Queries Posting List with positions

Promised: decompounding is harder than you think omroepers paperassen rotspartij zonnestroom plantenteelt uitslover kredietverstrekkers

Promised: decompounding is harder than you think omroepers paperassen rotspartij zonnestroom plantenteelt uitslover kredietverstrekkers om roe pers paper assen rot spar tij zon nest room plan tent eelt uit s lover krediet verstrek kers

Book Exercises

This week: Tolerant Retrieval Wildcard queries Spell correction Alternative indexes Finding the most similar terms

Wildcard queries: * mon*: find documents with words that starts with mon. *mon: find documents with words that end with mon. Harder. mo*n: find documents with words that start with mo and end with n. harder. Even m*o*n: Yet harder.

Wildcard queries Step 1: find all terms that fall within the wildcard definition Step 2: find all documents containing any of these terms

Wildcard queries Step 1: find all terms that fall within the wildcard definition B-trees Permuterm index K-gram index Step 2: find all documents containing any of these terms

Dictionary structures Hash. Very efficient lookup and construction, but a hash cannot be used to find terms that are close to the key. Python dictionaries are implemented by hashes. Binary tree, B-tree, Tries. Data-structures in which data is kept sorted (and balanced). Fairly efficient search, but more costly to construct. Words with same suffix are close together in the result, and therefore these structures can potentially be used for tolerant retrieval.

Hash

Binary tree

B-tree Extension of binary tree in which the tree remains balanced

Trie

Wildcard queries: * mon*: Easy with B-tree, easy with trie. *mon: Maintain additional B-tree or trie for all words in reverse mo*n: Intersect mo* and *n. Use reverse tree for *n. m*o*n:??

Wildcard queries: * mon*: Easy with B-tree, easy with trie. *mon: Maintain additional B-tree or trie for all words in reverse mo*n: Intersect mo* and *n. Use reverse tree for *n. m*o*n:?? Use permuterm index or K-gram index

K-gram index K-gram: group of K consecutive items. Here: characters. For example, if K=3, the K-gram index has keys of three consecutive characters. The key points to all terms which contain that sequence of three characters. Index for dictionary lookup, not for document retrieval. In a k-gram index, a key points to all relevant search terms.

Split words in K-grams, K=3 kitchen \/ $kitchen$ \/ $ki kit itc tch che hen en$

K-gram index, K=3 In a k-gram index, a key points to all relevant terms. $ki en$ che ink itt kit ==> {kinkiten kitchen kitten} ==> {kinkiten kitchen kitten kzen} ==> {bitch kitchen witch} ==> {kinkiten kinky} ==> {bitter kitten} ==> {kinkiten kitchen kitten}

K-gram index, K=3 In a k-gram index, a key points to all relevant terms. $ki en$ che ink itt kit ==> {kinkiten kitchen kitten} ==> {kinkiten kitchen kitten kzen} ==> {bitch kitchen witch} ==> {kinkiten kinky} ==> {bitter kitten} ==> {kinkiten kitchen kitten} The terms are sorted (why?)

K-gram index for wildcard queries Initial Query: kit*en Mapped to: $kit*en$ Search in K-gram index: $ki AND kit AND en$

K-gram index for wildcard queries Initial Query: kit*en Mapped to: $kit*en$ Search in K-gram index: $ki AND kit AND en$ Result: kinkiten kitchen kitten Postprocessing required: kinkiten

K-gram index for wildcard queries Initial Query: kit*en Mapped to: $kit*en$ Search in K-gram index: $ki AND kit AND en$ Result: kinkiten kitchen kitten Postprocessing required: kinkiten The remaining terms are used in OR query: kitchen OR kitten

Query processing What to do for this query: se*ate AND fil*er Expand se*ate to OR-query, e.g., selfhate OR seagate Expand fil*er to OR-query, e.g., filter OR filler Combine into ((selfhate OR seagate) AND (filter OR filler))

Spell Correction for Query Terms If a query term is not present in the term index (or if it is very rare)... Find similar terms Calculate similarity to the query term Use most similar one(s) Use most frequent one(s)

Spell Correction for Query Terms Find similar terms: K-gram index Calculate similarity to the query term: Jaccard, Levenshtein Use most similar one(s) Use most frequent one(s)

Spell Correction for Query Terms Find similar terms: K-gram index For instance: a term t 1 is similar to t 2 if one of the 3-grams of t 1 and t 2 are identical. For unknown term t 1, collect all of the terms in the 3-gram index of all 3-grams. Lots of candidates, only use good ones?

Spell Correction for Query Terms Query: brook $bro broek, brok, brommen, brons roo roomijs,vuurrood,brood,rook ook wierookstaafjes,stookolie,brood,rook ok$ werknemersblok,varkenshok,rook

Spell Correction for Query Terms Query: brook $bro broek, brok, brommen, brons roo roomijs,vuurrood,brood,rook ook wierookstaafjes,stookolie,brood,rook ok$ werknemersblok,varkenshok,rook From all those, only select the ones that are close to the original term

Spell Correction for Query Terms Calculate similarity to the query term Jaccard Jaccard coefficient: A B A B A: trigrams in term t 1 B: trigrams in term t 2

Jacard A B A B A: brook: $br,bro,roo,ook,ok$ B: rook: $ro,roo,ook,ok$

Jacard A B A B A: brook: $br,bro,roo,ook,ok$ B: rook: $ro,roo,ook,ok$ A B: roo,ook,ok$ A B: $br,bro,roo,ook,ok$,$ro

Jacard A B A B A: brook: $br,bro,roo,ook,ok$ B: rook: $ro,roo,ook,ok$ A B: roo,ook,ok$ A B: $br,bro,roo,ook,ok$,$ro Jaccard: 3/6 = 0.5

More precise Minimum Edit Distance Levenshtein Distance

Levenshtein Distance Distance between A and B: Minimum number of insertions, deletions or substitutions to map A to B

Levenshtein Distance Distance between A and B: Minimum number of insertions, deletions or substitutions to map A to B A: brook B: rook

Levenshtein Distance Distance between A and B: Minimum number of insertions, deletions or substitutions to map A to B A: brook B: rook Distance: 1

Levenshtein distance bakker brak otter boter bloed bode ondersteboven binnenstebuiten

Efficient Algorithm try out all possibilities? No. First compute Levenshtein distance for all prefixes Dynamic programming Suppose we need to compute distance for: ondersteboven,binnenstebuiten and we are given the following: dist(onderstebove,binnenstebuite) = 7 dist(onderstebove,binnenstebuiten) = 8 dist(ondersteboven,binnenstebuite) = 8

Efficient Algorithm x and y are strings a and b are symbols Suppose we need to compute distance for dist(xa,yb) and we have dist(x,y) dist(xa,y) dist(x,yb)

Efficient Algorithm? dist(xa,yb) and we have dist(x,y) dist(xa,y) dist(x,yb)

Efficient Algorithm cost(a,b): 0 if a==b; 1 otherwise There are three ways to construct xa,yb: dist(x,y) + cost(a,b) (substitution) dist(xa,y) + 1 (insdel) dist(x,yb) + 1 (insdel)

Efficient Algorithm cost(a,b): 0 if a==b; 1 otherwise There are three ways to construct xa,yb: dist(x,y) + cost(a,b) (substitution) dist(xa,y) + 1 (insdel) dist(x,yb) + 1 (insdel) Take the minimum

# b l o e d # 0 1 2 3 4 5 b o d e Efficient Algorithm: matrix

# b l o e d # 0 1 2 3 4 5 b 1 o d e Efficient Algorithm: matrix

# b l o e d # 0 1 2 3 4 5 b 1 0 1 2 3 4 o 2 1 1 1 2 3 d 3 2 2 2 2 2 e 4 3 3 3 2 3 Efficient Algorithm: matrix

Efficient algorithm: matrix Each cell in the matrix represents the distance between the corresponding prefixes The final result, therefore, can be found in... Other cost functions can be possible too E.g., substitutions for characters that are pronounced similarly could be given lower cost Sometimes, other basic edit operations can be considered (e.g. transposition)