Information Retrieval and Organisation

Similar documents
Introduction to Information Retrieval

CSE 7/5337: Information Retrieval and Web Search Introduction and Boolean Retrieval (IIR 1)

boolean queries Inverted index query processing Query optimization boolean model September 9, / 39

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

Introduction to Information Retrieval

Information Retrieval and Text Mining

Introduction to Information Retrieval

Introduction to Information Retrieval and Boolean model. Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H.

CS105 Introduction to Information Retrieval

Information Retrieval

Information Retrieval

Introducing Information Retrieval and Web Search. borrowing from: Pandu Nayak

Introduction to Information Retrieval

Information Retrieval

Boolean retrieval & basics of indexing CE-324: Modern Information Retrieval Sharif University of Technology

Information Retrieval

Querying Introduction to Information Retrieval INF 141 Donald J. Patterson. Content adapted from Hinrich Schütze

Advanced Retrieval Information Analysis Boolean Retrieval

Lecture 1: Introduction and Overview

Boolean Retrieval. Manning, Raghavan and Schütze, Chapter 1. Daniël de Kok

Boolean retrieval & basics of indexing CE-324: Modern Information Retrieval Sharif University of Technology

Informa(on Retrieval

Introduction to Information Retrieval IIR 1: Boolean Retrieval

Boolean retrieval & basics of indexing CE-324: Modern Information Retrieval Sharif University of Technology

Search: the beginning. Nisheeth

Information Retrieval

Information Retrieval

Part 2: Boolean Retrieval Francesco Ricci

CS 572: Information Retrieval. Lecture 2: Hello World! (of Text Search)

PV211: Introduction to Information Retrieval

CS60092: Informa0on Retrieval. Sourangshu Bha<acharya

Preliminary draft (c)2006 Cambridge UP

Indexing. Lecture Objectives. Text Technologies for Data Science INFR Learn about and implement Boolean search Inverted index Positional index

Information Retrieval

Introduction to Information Retrieval

Unstructured Data Management. Advanced Topics in Database Management (INFSCI 2711)

Information Retrieval

Classic IR Models 5/6/2012 1

1Boolean retrieval. information retrieval. term search is quite ambiguous, but in context we use the two synonymously.

Text Retrieval and Web Search IIR 1: Boolean Retrieval

Lecture 1: Introduction and the Boolean Model

Behrang Mohit : txt proc! Review. Bag of word view. Document Named

Ges$one Avanzata dell Informazione Part A Full- Text Informa$on Management. Full- Text Indexing

Information Retrieval

Information Retrieval

Data-analysis and Retrieval Boolean retrieval, posting lists and dictionaries

CSCI 5417 Information Retrieval Systems! What is Information Retrieval?

Web Information Retrieval Exercises Boolean query answering. Prof. Luca Becchetti

Models for Document & Query Representation. Ziawasch Abedjan

Overview of Information Retrieval and Organization. CSC 575 Intelligent Information Retrieval

A brief introduction to Information Retrieval

Information Retrieval

Information Retrieval Tutorial 1: Boolean Retrieval

F INTRODUCTION TO WEB SEARCH AND MINING. Kenny Q. Zhu Dept. of Computer Science Shanghai Jiao Tong University

Vector Space Models. Jesse Anderton

Introduction to Information Retrieval

PV211: Introduction to Information Retrieval

Information Retrieval CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Informa(on Retrieval. Administra*ve. Sta*s*cal MT Overview. Problems for Sta*s*cal MT

Information Retrieval. Chap 8. Inverted Files

Introduction to Information Retrieval

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

Boolean retrieval & basics of indexing CE-324: Modern Information Retrieval Sharif University of Technology

This lecture. Introduction to information retrieval. Making money with information retrieval. Some technical basics. Link analysis.

FRONT CODING. Front-coding: 8automat*a1 e2 ic3 ion. Extra length beyond automat. Encodes automat. Begins to resemble general string compression.

Informa(on Retrieval

An Introduction to Information Retrieval. Draft of March 1, Preliminary draft (c)2007 Cambridge UP

An Introduction to Information Retrieval. Draft of March 5, Preliminary draft (c)2007 Cambridge UP

Data Modelling and Multimedia Databases M

Recap: lecture 2 CS276A Information Retrieval

EECS 395/495 Lecture 3 Scalable Indexing, Searching, and Crawling

Information Retrieval and Organisation

Information retrieval systems

COSC572 GUEST LECTURE - PROF. GRACE HUI YANG INTRODUCTION TO INFORMATION RETRIEVAL NOV 2, 2016

An Introduction to Information Retrieval. Draft of April 15, Preliminary draft (c)2007 Cambridge UP

Information Retrieval and Organisation

More on indexing CE-324: Modern Information Retrieval Sharif University of Technology

Information Retrieval and Organisation

CS54701: Information Retrieval

60-538: Information Retrieval

The Web document collection

Efficient Implementation of Postings Lists

- Content-based Recommendation -

An Introduction to Information Retrieval. Draft of August 14, Preliminary draft (c)2007 Cambridge UP

A Closeup View. Class Overview CSE 454. Relevance. Retrieval Model Overview. 10/19 IR & Indexing 10/21 Google & Alta.

Corso di Biblioteche Digitali

Information Retrieval. Danushka Bollegala

Query Answering Using Inverted Indexes

CS347. Lecture 2 April 9, Prabhakar Raghavan

Today s topics CS347. Inverted index storage. Inverted index storage. Processing Boolean queries. Lecture 2 April 9, 2001 Prabhakar Raghavan

Information Retrieval

Information Retrieval II

Information Retrieval. CS630 Representing and Accessing Digital Information. What is a Retrieval Model? Basic IR Processes

3-2. Index construction. Most slides were adapted from Stanford CS 276 course and University of Munich IR course.

Lecture 3: Phrasal queries and wildcards

Information Retrieval

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

Information Technology for Documentary Data Representation

Digital Libraries: Language Technologies

Efficiency. Efficiency: Indexing. Indexing. Efficiency Techniques. Inverted Index. Inverted Index (COSC 488)

Transcription:

Information Retrieval and Organisation Dell Zhang Birkbeck, University of London 2016/17

IR Chapter 01 Boolean Retrieval

Example IR Problem Let s look at a simple IR problem Suppose you own a copy of Shakespeare s Collected Works You are interested in finding out which plays contain the words Brutus AND Caesar AND NOT Calpurnia Possible solutions: Start reading... Use string-matching algorithm (e.g. grep) scanning files For simple queries on small to modest collections (Shakespeare s Collected Works contain not quite a million words) this is OK.

Limits of Scanning For many purposes, you need more: Process large collections containing billions or trillions of words quickly Allow for more flexible matching operations, e.g. Romans NEAR countrymen Rank answers according to importance (when a large number of documents is returned) Let s look at the performance problem first: Solution: do preprocessing

Term-Document Incidence Matrix Anthony Julius The Hamlet Othello Macbeth... and Caesar Tempest Cleopatra Anthony 1 1 0 0 0 1 Brutus 1 1 0 1 0 0 Caesar 1 1 0 1 1 1 Calpurnia 0 1 0 0 0 0 Cleopatra 1 0 0 0 0 0 mercy 1 0 1 1 1 1 worser 1 0 1 1 1 0... Entry is 1 if term occurs. Example: Calpurnia occurs in Julius Caesar. Entry is 0 if term doesn t occur. Example: Calpurnia does not occur in The Tempest.

Incidence Vectors So we have a 0/1 vector for each term. To answer the query Brutus AND Caesar AND NOT Calpurnia: Take the vectors for Brutus, Caesar, and Calpurnia Complement the vector of Calpurnia Do a (bitwise) AND on the three vectors 110100 AND 110111 AND 101111 = 100100

Indexing Large Collections Consider N = 10 6 documents, each with about 1000 tokens On average 6 bytes per token, including spaces and punctuation the size of document collection is about 6 GB Assume there are M = 500,000 distinct terms in the collection

Building Incidence Matrix M = 500,000 10 6 = half a trillion 0s and 1s. We would use about 60GB to index 6GB of text, which is clearly very inefficient. But, wait a minute, the matrix has no more than one billion 1s. The matrix is extremely sparse, i.e. 99.8% is filled with 0s. What is a better representations? We only record the 1s.

Inverted Index For each term t, we store a list of IDs of all documents that contain t. Brutus 1 2 4 11 31 45 173 174 Caesar 1 2 4 5 6 16 57 132... Calpurnia 2 31 54 101. }{{}}{{} dictionary postings

Index Construction Collect the documents to be indexed: Friends, Romans, countrymen. So let it be with Caesar... Tokenize the text, turning each document into a list of tokens: Friends Romans countrymen So... Do linguistic preprocessing, producing a list of normalized tokens, which are the indexing terms: friend roman countryman so... Index the documents that each term occurs in by creating an inverted index, consisting of a dictionary and postings.

Index Construction Later on in this module, we ll talk about optimizing inverted indexes: Index construction: how can we create inverted indexes for large collections? How much space do we need for dictionary and index? Index compression: how can we efficiently store and process indexes for large collections? Ranked retrieval: what does the inverted index look like when we want the best answer?

Processing Boolean Queries Consider the conjunctive query: Brutus AND Calpurnia To find all matching documents using inverted index: 1. Locate Brutus in the dictionary 2. Retrieve its postings list from the postings file 3. Locate Calpurnia in the dictionary 4. Retrieve its postings list from the postings file 5. Intersect the two postings lists 6. Return intersection to user

Intersecting Postings Lists Brutus 1 2 4 11 31 45 173 174 Calpurnia 2 31 54 101 Intersection = 2 31 Can be done in linear time if postings lists are sorted

Intersecting Postings Lists

Mapping Operators to Lists The Boolean operators AND, OR, and NOT are evaluated as follows: term1 AND term2: intersection of the lists for term1 and term2 term1 OR term2: union of the lists for term1 and term2 NOT term1: complement of the list for term1

Query Optimization What is the best order for query processing? Consider a query that is an AND of n terms, n > 2 For each of the terms, get its postings list, then AND them together Example query: Brutus AND Calpurnia AND Caesar Brutus 1 2 4 11 31 45 173 174 Calpurnia 2 31 54 101 Caesar 5 31

Query Optimization Simple and effective optimization: Process in the order of increasing frequency Start with the shortest postings list, then keep cutting further In this example, first Caesar, then Calpurnia, then Brutus

Optimized Intersection Algorithm

Commercial Boolean IR: Westlaw Largest commercial legal search service in terms of the number of paying subscribers (www.westlaw.com) Over half a million subscribers performing millions of searches a day over tens of terabytes of text data The service was started in 1975. In 2005, Boolean search (called Terms and Connectors by Westlaw) was still the default, and used by a large percentage of users...... although ranked retrieval has been available since 1992.

Westlaw Example Queries Information need: Information on the legal theories involved in preventing the disclosure of trade secrets by employees formerly employed by a competing company trade secret /s disclos! /s prevent /s employe! Information need: Requirements for disabled people to be able to access a workplace disab! /p access! /s work-site work-place (employment /3 place) Information need: Cases about a host s responsibility for drunk guests host! /p (responsib! liab!) /p (intoxicat! drunk!) /p guest

Westlaw Example Queries /s = within same sentence /p = within same paragraph /n = within n words Space is disjunction, not conjunction (This was the default in search pre-google.) & is AND! is a trailing wildcard query

Summary The Boolean retrieval model can answer any query that is a Boolean expression. Boolean queries are queries that use AND, OR and NOT to join query terms. Views each document as a set of terms. It is precise: document matches condition or not. Primary commercial retrieval tool for 3 decades Many professional searchers (e.g., lawyers) still like Boolean queries You know exactly what you are getting. When are Boolean queries the best way of searching? It depends on: information need, searcher, document collection,...