Information Retrieval

Similar documents
3-1. Dictionaries and Tolerant Retrieval. Most slides were adapted from Stanford CS 276 course and University of Munich IR course.

Information Retrieval

Recap of the previous lecture. This lecture. A naïve dictionary. Introduction to Information Retrieval. Dictionary data structures Tolerant retrieval

Overview. Lecture 3: Index Representation and Tolerant Retrieval. Type/token distinction. IR System components

Tolerant Retrieval. Searching the Dictionary Tolerant Retrieval. Information Retrieval & Extraction Misbhauddin 1

Introduction to Information Retrieval (Manning, Raghavan, Schutze)

Dictionaries and tolerant retrieval CE-324 : Modern Information Retrieval Sharif University of Technology

Dictionaries and tolerant retrieval CE-324 : Modern Information Retrieval Sharif University of Technology

Dictionaries and Tolerant retrieval

Dictionaries and tolerant retrieval. Slides by Manning, Raghavan, Schutze

Text Technologies for Data Science INFR Indexing (2) Instructor: Walid Magdy

Text Technologies for Data Science INFR Indexing (2) Instructor: Walid Magdy

Web Information Retrieval. Lecture 4 Dictionaries, Index Compression

Information Retrieval

Recap of last time CS276A Information Retrieval

OUTLINE. Documents Terms. General + Non-English English. Skip pointers. Phrase queries

Information Retrieval

Part 2: Boolean Retrieval Francesco Ricci

Information Retrieval

Boolean retrieval & basics of indexing CE-324: Modern Information Retrieval Sharif University of Technology

Boolean retrieval & basics of indexing CE-324: Modern Information Retrieval Sharif University of Technology

Information Retrieval

信息检索与搜索引擎 Introduction to Information Retrieval GESC1007

Information Retrieval

Information Retrieval

Information Retrieval

Introduction to Information Retrieval

Index construction CE-324: Modern Information Retrieval Sharif University of Technology

Recap: lecture 2 CS276A Information Retrieval

Indexing and Searching

Index construction CE-324: Modern Information Retrieval Sharif University of Technology

Efficiency. Efficiency: Indexing. Indexing. Efficiency Techniques. Inverted Index. Inverted Index (COSC 488)

Information Retrieval. Lecture 3 - Index compression. Introduction. Overview. Characterization of an index. Wintersemester 2007

Indexing and Searching

Index construction CE-324: Modern Information Retrieval Sharif University of Technology

3-2. Index construction. Most slides were adapted from Stanford CS 276 course and University of Munich IR course.

Module 3: Hashing Lecture 9: Static and Dynamic Hashing. The Lecture Contains: Static hashing. Hashing. Dynamic hashing. Extendible hashing.

Introduction to. CS276: Information Retrieval and Web Search Christopher Manning and Prabhakar Raghavan. Lecture 4: Index Construction

CS347. Lecture 2 April 9, Prabhakar Raghavan

Index Compression. David Kauchak cs160 Fall 2009 adapted from:

Today s topics CS347. Inverted index storage. Inverted index storage. Processing Boolean queries. Lecture 2 April 9, 2001 Prabhakar Raghavan

Inverted Indexes. Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 p. 5

Index Construction. Dictionary, postings, scalable indexing, dynamic indexing. Web Search

Introduction to Information Retrieval

INDEX CONSTRUCTION 1

Text Analytics. Index-Structures for Information Retrieval. Ulf Leser

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

Text Analytics. Index-Structures for Information Retrieval. Ulf Leser

Course work. Today. Last lecture index construc)on. Why compression (in general)? Why compression for inverted indexes?

Hash-Based Indexes. Chapter 11

Administrative. Distributed indexing. Index Compression! What I did last summer lunch talks today. Master. Tasks

CS60092: Informa0on Retrieval

CSI33 Data Structures

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

Chapter 5 Hashing. Introduction. Hashing. Hashing Functions. hashing performs basic operations, such as insertion,

Tables. The Table ADT is used when information needs to be stored and acessed via a key usually, but not always, a string. For example: Dictionaries

Information Retrieval

Index Construction 1

Information Retrieval

Advanced Database Systems

Assignment No. 1. Abdurrahman Yasar. June 10, QUESTION 1

Database Technology. Topic 7: Data Structures for Databases. Olaf Hartig.

CS 525: Advanced Database Organization 04: Indexing

Introduction. hashing performs basic operations, such as insertion, better than other ADTs we ve seen so far

Chapter 27 Hashing. Liang, Introduction to Java Programming, Eleventh Edition, (c) 2017 Pearson Education, Inc. All rights reserved.

CS6200 Information Retrieval. David Smith College of Computer and Information Science Northeastern University

Hashing Round-down Prefixes for Rapid Packet Classification

Hash Table and Hashing

Information Retrieval and Organisation

Module 4: Index Structures Lecture 13: Index structure. The Lecture Contains: Index structure. Binary search tree (BST) B-tree. B+-tree.

Information Retrieval 6. Index compression

Indexing. UCSB 290N. Mainly based on slides from the text books of Croft/Metzler/Strohman and Manning/Raghavan/Schutze

COMP3121/3821/9101/ s1 Assignment 1

V.2 Index Compression

Kathleen Durant PhD Northeastern University CS Indexes

CS 245: Database System Principles

Index Construction. Slides by Manning, Raghavan, Schutze

index construct Overview Overview Recap How to construct index? Introduction Index construction Introduction to Recap

Logistics. CSE Case Studies. Indexing & Retrieval in Google. Review: AltaVista. BigTable. Index Stream Readers (ISRs) Advanced Search

Index extensions. Manning, Raghavan and Schütze, Chapter 2. Daniël de Kok

Hash-Based Indexing 1

Corso di Biblioteche Digitali

Lecture 11: Packet forwarding

Extra: B+ Trees. Motivations. Differences between BST and B+ 10/27/2017. CS1: Java Programming Colorado State University

Data Structures and Methods. Johan Bollen Old Dominion University Department of Computer Science

Lecture 5: Information Retrieval using the Vector Space Model

Outline of the course

Introduction to File Structures

Introduc)on to. CS60092: Informa0on Retrieval

Information Retrieval. Chap 7. Text Operations

Preliminary draft (c)2008 Cambridge UP

CMSC 341 Lecture 16/17 Hashing, Parts 1 & 2

Cuckoo Hashing for Undergraduates

Representing Data Elements

Lecture 3: Phrasal queries and wildcards

Lecture Notes on Tries

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2009 Lecture 12 Google Bigtable

File Structures and Indexing

DDS Dynamic Search Trees

IN4325 Indexing and query processing. Claudia Hauff (WIS, TU Delft)

Transcription:

Introduction to Information Retrieval Lecture 3: Dictionaries and tolerant retrieval 1

Outline Dictionaries Wildcard queries skip Edit distance skip Spelling correction skip Soundex 2

Inverted index Our first task is to determine whether each query term exists in the vocabulary If so, identify the pointer to the corresponding postings. 3

Dictionaries The dictionary is the data structure for storing the term vocabulary. Term vocabulary: the data Dictionary: the data structure for storing the term vocabulary 4

Dictionary as array of fixed-width entries For each term, we need to store a couple of items: document frequency pointer to postings list... Assume for the time being that we can store this information in a fixed-length entry. Assume that we store these entries in an array. 5

Dictionary as array of fixed-width entries space needed: 20 bytes 4 bytes 4 bytes 100,000 terms we need only 2.8 MB How do we look up a query term q i in this array at query time? That is: which data structure do we use to locate the entry in the array where q i is stored? 6

Data structures for looking up term Two main classes of data structures: hashes and trees Some IR systems use hashes, some use trees. Criteria for when to use hashes vs. trees: Is there a fixed number of terms or will it keep growing? What are the relative frequencies with which various keys will be accessed? How many terms are we likely to have? 7

Hashes Each vocabulary term is hashed into an integer. Collisions can be resolved by auxiliary structures. At query time: hash query term, resolve collisions, locate entry in fixed-width array Pros: Lookup in a hash is faster than lookup in a tree. Cons Lookup time is constant. no way to find minor variants (resume vs. résumé) no prefix search (all terms starting with automat) need to rehash everything periodically if vocabulary keeps growing 8

Trees Trees solve the prefix problem (find all terms starting with automat). Simplest tree: binary tree each internal node has two children. Search is slightly slower than in hashes: O(logM), where M is the size of the vocabulary. O(logM) only holds for balanced trees. Rebalancing binary trees is expensive. B-trees mitigate the rebalancing problem. B-tree definition: every internal node has a number of children in the interval [a, b] where a, b are appropriate positive integers, e.g., [2, 4]. 9 9

Binary tree 10

B-tree 11

Outline Dictionaries Wildcard queries skip Edit distance skip Spelling correction skip Soundex 12

Wildcard queries mon*: find all docs containing any term beginning with mon Easy with B-tree dictionary: retrieve all terms t in the range: mon t < moo *mon: find all docs containing any term ending with mon Maintain an additional tree for terms backwards Then retrieve all terms t in the range: mon t < moo Result: A set of terms that are matches for wildcard query Then retrieve documents that contain any of these terms 13

Outline Dictionaries Wildcard queries skip Edit distance skip Spelling correction skip Soundex 14

Soundex Soundex is the basis for finding phonetic (as opposed to orthographic) alternatives. Applicable to search on the names of people. Example: chebyshev / tchebyscheff The main idea here is to generate, for each term, a phonetic hash, so that similar-sounding terms hash to the same value. Algorithm: Turn every token to be indexed into a 4-character reduced form Do the same with query terms Build and search an index on the reduced forms 15

Soundex algorithm skip ❶ Retain the first letter of the term. ❷ Change all occurrences of the following letters to 0 (zero): A, E, I, O, U, H, W, Y ❸ Change letters to digits as follows: B, F, P, V to 1 C, G, J, K, Q, S, X, Z to 2 D,T to 3 L to 4 M, N to 5 R to 6 ❹ Repeatedly remove one out of each pair of consecutive identical digits ❺ Remove all zeros from the resulting string; pad the resulting string with trailing zeros and return the first four positions, which will consist of a letter followed by three digits 16 16

Example: Soundex of HERMAN Retain H ERMAN 0RM0N 0RM0N 06505 06505 06505 06505 655 Return H655 Note: HERMANN will generate the same code 17

How useful is Soundex? Not very for information retrieval Ok for high recall tasks in other applications (e.g., Interpol) Zobel and Dart (1996) suggest better alternatives for phonetic matching in IR. 18