Introduction to Information Retrieval

Similar documents
Introduction to Information Retrieval and Boolean model. Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H.

Introducing Information Retrieval and Web Search. borrowing from: Pandu Nayak

Information Retrieval

Information Retrieval

Information Retrieval

CS105 Introduction to Information Retrieval

Information Retrieval

Search: the beginning. Nisheeth

Information Retrieval

Advanced Retrieval Information Analysis Boolean Retrieval

Boolean retrieval & basics of indexing CE-324: Modern Information Retrieval Sharif University of Technology

CS60092: Informa0on Retrieval. Sourangshu Bha<acharya

Unstructured Data Management. Advanced Topics in Database Management (INFSCI 2711)

Boolean retrieval & basics of indexing CE-324: Modern Information Retrieval Sharif University of Technology

Information Retrieval

Boolean retrieval & basics of indexing CE-324: Modern Information Retrieval Sharif University of Technology

Information Retrieval

CSE 7/5337: Information Retrieval and Web Search Introduction and Boolean Retrieval (IIR 1)

Introduction to Information Retrieval

Information Retrieval

Part 2: Boolean Retrieval Francesco Ricci

CS 572: Information Retrieval. Lecture 2: Hello World! (of Text Search)

Behrang Mohit : txt proc! Review. Bag of word view. Document Named

boolean queries Inverted index query processing Query optimization boolean model September 9, / 39

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

Information Retrieval

CSCI 5417 Information Retrieval Systems! What is Information Retrieval?

Introduction to Information Retrieval

Indexing. Lecture Objectives. Text Technologies for Data Science INFR Learn about and implement Boolean search Inverted index Positional index

Information Retrieval and Text Mining

More on indexing CE-324: Modern Information Retrieval Sharif University of Technology

Introduction to Information Retrieval

Classic IR Models 5/6/2012 1

Ges$one Avanzata dell Informazione Part A Full- Text Informa$on Management. Full- Text Indexing

1Boolean retrieval. information retrieval. term search is quite ambiguous, but in context we use the two synonymously.

Informa(on Retrieval

Information Retrieval and Organisation

Introduction to Information Retrieval

Lecture 1: Introduction and the Boolean Model

Boolean Retrieval. Manning, Raghavan and Schütze, Chapter 1. Daniël de Kok

Web Information Retrieval. Lecture 2 Tokenization, Normalization, Speedup, Phrase Queries

Introduction to Information Retrieval IIR 1: Boolean Retrieval

Preliminary draft (c)2006 Cambridge UP

Lecture 1: Introduction and Overview

Information Retrieval. Chap 8. Inverted Files

PV211: Introduction to Information Retrieval

Web Information Retrieval Exercises Boolean query answering. Prof. Luca Becchetti

Data-analysis and Retrieval Boolean retrieval, posting lists and dictionaries

Text Retrieval and Web Search IIR 1: Boolean Retrieval

Overview of Information Retrieval and Organization. CSC 575 Intelligent Information Retrieval

CS347. Lecture 2 April 9, Prabhakar Raghavan

Today s topics CS347. Inverted index storage. Inverted index storage. Processing Boolean queries. Lecture 2 April 9, 2001 Prabhakar Raghavan

Boolean retrieval & basics of indexing CE-324: Modern Information Retrieval Sharif University of Technology

Recap: lecture 2 CS276A Information Retrieval

A brief introduction to Information Retrieval

Querying Introduction to Information Retrieval INF 141 Donald J. Patterson. Content adapted from Hinrich Schütze

Introduction to Information Retrieval

Models for Document & Query Representation. Ziawasch Abedjan

EECS 395/495 Lecture 3 Scalable Indexing, Searching, and Crawling

Index construction CE-324: Modern Information Retrieval Sharif University of Technology

60-538: Information Retrieval

Index construction CE-324: Modern Information Retrieval Sharif University of Technology

Index construction CE-324: Modern Information Retrieval Sharif University of Technology

Information Retrieval

CSCI 5417 Information Retrieval Systems Jim Martin!

Efficient Implementation of Postings Lists

3-2. Index construction. Most slides were adapted from Stanford CS 276 course and University of Munich IR course.

Administrative. Distributed indexing. Index Compression! What I did last summer lunch talks today. Master. Tasks

Information Retrieval

Information Retrieval CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Information Retrieval

Information Retrieval Tutorial 1: Boolean Retrieval

F INTRODUCTION TO WEB SEARCH AND MINING. Kenny Q. Zhu Dept. of Computer Science Shanghai Jiao Tong University

Information Retrieval

Index Construction 1

COSC572 GUEST LECTURE - PROF. GRACE HUI YANG INTRODUCTION TO INFORMATION RETRIEVAL NOV 2, 2016

index construct Overview Overview Recap How to construct index? Introduction Index construction Introduction to Recap

Course structure & admin. CS276A Text Information Retrieval, Mining, and Exploitation. Dictionary and postings files: a fast, compact inverted index

Text Pre-processing and Faster Query Processing

n Tuesday office hours changed: n 2-3pm n Homework 1 due Tuesday n Assignment 1 n Due next Friday n Can work with a partner

Introduction to Computational Advertising. MS&E 239 Stanford University Autumn 2010 Instructors: Andrei Broder and Vanja Josifovski

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

Corso di Biblioteche Digitali

More on indexing and text operations CE-324: Modern Information Retrieval Sharif University of Technology

Information Retrieval CS Lecture 01. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Recap of the previous lecture. Recall the basic indexing pipeline. Plan for this lecture. Parsing a document. Introduction to Information Retrieval

Index Construction. Slides by Manning, Raghavan, Schutze

Information Retrieval and Organisation

2018 EE448, Big Data Mining, Lecture 8. Search Engines. Weinan Zhang Shanghai Jiao Tong University

Information Retrieval CS-E credits

Introduction to. CS276: Information Retrieval and Web Search Christopher Manning and Prabhakar Raghavan. Lecture 4: Index Construction

Information Retrieval and Web Search

Information Retrieval

Informa(on Retrieval

Information Retrieval

- Content-based Recommendation -

An Introduction to Information Retrieval. Draft of April 15, Preliminary draft (c)2007 Cambridge UP

Information Retrieval. Danushka Bollegala

Introduction to Information Retrieval

FRONT CODING. Front-coding: 8automat*a1 e2 ic3 ion. Extra length beyond automat. Encodes automat. Begins to resemble general string compression.

Transcription:

Mustafa Jarrar: Lecture Notes on Information Retrieval University of Birzeit, Palestine 2014 Introduction to Information Retrieval Dr. Mustafa Jarrar Sina Institute, University of Birzeit mjarrar@birzeit.edu www.jarrar.info Jarrar 2014 1

Watch this lecture and download the slides from http://jarrar-courses.blogspot.com/2011/11/artificial-intelligence-fall-2011.html Jarrar 2014 2

Outline Part 1: Information Retrieval Basics Part 2: Term-document incidence matrices Part 3: Inverted Index Part 4: Query processing with inverted index Part 5: Phrase queries and positional indexes Keywords: Natural Language Processing, Information Retrieval, Precision, Recall, Inverted Index, Positional Indexes, Query processing, Merge Algorithm, B iwords, P hrase queries اللسانيات احالسوبية, استرجاع املعلومات, الدقة, استعالم, تطبيقات لغوية, املعاجلة اآللية للغات الطبيعية Acknowledgement: This lecture is largely based on Chris Manning online course on NLP, which can be accessed at http://www.youtube.com/watch?v=s3kkluba3b0, [1] Jarrar 2014 3

Information Retrieval Information Retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers). These days we frequently think first of web search, but there are many other cases: E-mail search Searching your laptop Corporate knowledge bases Legal information retrieval Jarrar 2014 4

Unstructured (text) vs. structured (database) data in the mid-nineties Jarrar 2014 5

Unstructured (text) vs. structured (database) data later Jarrar 2014 6

Basic Assumptions of Information Retrieval Collection: A set of documents Assume it is a static collection (but in other scenarios we may need to add and delete documents). Goal: Retrieve documents with information that is relevant to the user s information need and helps the user complete a task. Jarrar 2014 7

Classic Search Model User task Info need Misconception? Misformulation? Get rid of mice in a politically correct way Info about removing mice without killing them Query how trap mice alive Search Search engine Query refinement Results Collection Jarrar 2014 8

How good are the retrieved docs? Precision : Fraction of retrieved docs that are relevant to the user s information.(نسبة الصحيح من بين ما تم استرجاعة) need Recall : Fraction of relevant docs in collection that are retrieved نسبة الصحيح المسترجع من بين جميع الصحيح 11 14 8 4 In this figure the relevant items are to the left of the straight line while the retrieved items are within the oval. The red regions represent errors. On the left these are the relevant items not retrieved (false negatives), while on the right they are the retrieved items that are not relevant (false positives). P= 8/12 = 66%, R=8/14 = 57% Are these measures are always useful as in case of huge collections? Ranking becomes more important sometimes. Jarrar 2014 9

Introduction to Information Retrieval Part 1: Information Retrieval Basics Part 2: Term-document incidence matrices Part 3: Inverted Index Part 4: Query processing with inverted index Part 5: Phrase queries and positional indexes Jarrar 2014 10

Unstructured Data in 1620 Which plays of Shakespeare contain the words Brutus AND Caesar but NOT Calpurnia? One could grep all of Shakespeare s plays for Brutus and Caesar, then strip out lines containing Calpurnia? Why is that not the answer? Slow (for large corpora) NOT Calpurnia is non-trivial Other operations (e.g., find the word Romans near countrymen) not feasible Ranked retrieval (best documents to return) Jarrar 2014 11

Term-Document Incidence Matrices Brutus AND Caesar BUT NOT Calpurnia Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth Antony 1 1 0 0 0 1 Brutus 1 1 0 1 0 0 Caesar 1 1 0 1 1 1 Calpurnia 0 1 0 0 0 0 Cleopatra 1 0 0 0 0 0 mercy 1 0 1 1 1 1 worser 1 0 1 1 1 0 1 if play contains word, 0 otherwise Jarrar 2014 12

Incidence Vectors So we have a 0/1 vector for each term. To answer query: take the vectors for Brutus, Caesar and Calpurnia (complemented) bitwise AND. 110100 AND 110111 AND 101111 = 100100 Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth Antony 1 1 0 0 0 1 Brutus 1 1 0 1 0 0 Caesar 1 1 0 1 1 1 Calpurnia 0 1 0 0 0 0 Cleopatra 1 0 0 0 0 0 mercy 1 0 1 1 1 1 worser 1 0 1 1 1 0 Jarrar 2014 13

Answers to Query Antony and Cleopatra, Act III, Scene ii Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus, When Antony found Julius Caesar dead, He cried almost to roaring; and he wept When at Philippi he found Brutus slain. Hamlet, Act III, Scene ii Lord Polonius: I did enact Julius Caesar I was killed i the Capitol; Brutus killed me. Jarrar 2014 14

With Bigger Collections! Consider N = 1 million documents, each with about 1000 words. Avg 6 bytes/word including spaces/punctuation = 6GB of data in the documents. Say there are M = 500K distinct terms among these. Jarrar 2014 15

We Can t Build this Matrix! 500K x 1M matrix has half-a-trillion 0 s and 1 s. But it has no more than one billion 1 s. ( =1000 words x 1 M doc.) matrix is extremely sparse. What s a better representation? We only record the 1 positions. Jarrar 2014 16

Introduction to Information Retrieval Part 1: Information Retrieval Basics Part 2: Term-document incidence matrices Part 3: Inverted Index Part 4: Query processing with inverted index Part 5: Phrase queries and positional indexes Jarrar 2014 17

What is an Inverted Index? For each term t, we must store a list of all documents that contain t. Identify each doc by a docid, a document serial number Can we used fixed-size arrays for this? Brutus 1 2 4 11 31 45 173174 Caesar 1 2 4 5 6 16 57 132 Calpurnia 2 31 54101 What happens if the word Caesar is added to document 14? Jarrar 2014 18

Inverted index We need variable-size postings lists On disk, a continuous run of postings is normal and best In memory, can use linked lists or variable length arrays Some tradeoffs in size/ease of insertion Posting Brutus 1 2 4 11 31 45 173174 Caesar 1 2 4 5 6 16 57 132 Calpurnia 2 31 54101 Dictionary Postings Sorted by docid (more later on why). Jarrar 2014 19

Inverted Index Construction Documents to be indexed Friends, Romans, countrymen. Tokenizer Token stream Friends Romans Countrymen Linguistic modules Modified tokens friend roman countryman Indexer friend 2 4 Inverted index roman countryman 1 2 13 16 Jarrar 2014 20

Initial Stages of Text Processing Tokenization Cut character sequence into word tokens (=what is a word!) Deal with John s, a state-of-the-art solution Normalization Map text and query term to same form You want U.S.A. and USA to match Stemming We may wish have different forms of a root to match authorize, authorization Stop words We may omit very common words (or not) the, a, to, of, over, between, his, him Jarrar 2014 21

Indexer Steps: Token Sequence Sequence of (Modified token, Document ID) pairs. Doc 1 I did enact Julius Caesar I was killed i the Capitol; Brutus killed me. Doc 2 So let it be with Caesar. The noble Brutus hath told you Caesar was ambitious Jarrar 2014 22

Indexer Steps: Sort Sort by terms And then docid Core indexing step Jarrar 2014 23

Indexer Steps: Dictionary & Postings Multiple term entries in a single document are merged. Split into Dictionary and Postings Doc. frequency information is added. Why frequency? Jarrar 2014 24

Where do we pay in storage? Lists of docids Terms and counts Pointers IR system implementation How do we index efficiently? How much storage do we need? Jarrar 2014 25

Introduction to Information Retrieval Part 1: Information Retrieval Basics Part 2: Term-document incidence matrices Part 3: Inverted Index Part 4: Query processing with inverted index Part 5: Phrase queries and positional indexes Jarrar 2014 26

Query processing: AND Consider processing the query: Brutus AND Caesar Locate Brutus in the Dictionary; Retrieve its postings. Locate Caesar in the Dictionary; Retrieve its postings. Merge the two postings (intersect the document sets): 2 4 8 16 32 64 1 2 3 5 8 13 21 128 34 Brutus Caesar Jarrar 2014 27

The Merge Walk through the two postings simultaneously, in time linear in the total number of postings entries 2 8 2 4 8 16 32 64 1 2 3 5 8 13 21 128 34 Brutus Caesar If the list lengths are x and y, the merge takes O(x+y) operations. Crucial: postings sorted by docid. Jarrar 2014 28

Intersecting two postings lists (a merge algorithm) Jarrar 2014 29

Introduction to Information Retrieval Part 1: Information Retrieval Basics Part 2: Term-document incidence matrices Part 3: Inverted Index Part 4: Query processing with inverted index Part 5: Phrase queries and positional indexes Jarrar 2014 30

Phrase Queries We want to be able to answer queries such as stanford university as a phrase Thus the sentence I went to university at Stanford is not a match. The concept of phrase queries has proven easily understood by users; one of the few advanced search ideas that works For this, it no longer suffices to store only <term : docs> entries Jarrar 2014 31

First Attempt: Biword Indexes Index every consecutive pair of terms in the text as a phrase For example the text Friends, Romans, Countrymen would generate the biwords friends romans romans countrymen Each of these biwords is now a dictionary term Two-word phrase query-processing is now immediate. Jarrar 2014 32

Longer Phrase Queries Longer phrases can be processed by breaking them down stanford university palo alto can be broken into the Boolean query on biwords: stanford university AND university palo AND palo alto Without the docs, we cannot verify that the docs matching the above Boolean query do contain the phrase. Can have false positives! Jarrar 2014 33

Issues for Biword Indexes False positives, as noted before Index blowup due to bigger dictionary Infeasible for more than biwords, big even for them Biword indexes are not the standard solution (for all biwords) but can be part of a compound strategy This is not practical/standard solution! Jarrar 2014 34

Solution 2: Positional Indexes In the postings, store, for each term the position(s) in which tokens of it appear: <term, number of docs containing term; doc1: position1, position2 ; doc2: position1, position2 ; etc.> Jarrar 2014 35

Example: Positional index <be: 993427; 1: 7, 18, 33, 72, 86, 231; 2: 3, 149; 4: 17, 191, 291, 430, 434; 5: 363, 367, > Which of docs 1,2,4,5 could contain to be or not to be? For phrase queries, we use a merge algorithm recursively at the document level. But we now need to deal with more than just equality. Example: if we look for Birzeit University then if Birzeit in 87 then University should be in 88. Jarrar 2014 36

Processing Phrase Queries Extract inverted index entries for each distinct term: to, be, or, not. Merge their doc:position lists to enumerate all positions with to be or not to be. to: be: 2:1,17,74,222,551; 4:8,16,190,429,433; 7:13,23,191;... 1:17,19; 4:17,191,291,430,434; 5:14,19,101;... Same general method for proximity searches Jarrar 2014 37

Positional Index Size A positional index expands postings storage substantially Even though indices can be compressed. Nevertheless, a positional index is now standardly used because of the power and usefulness of phrase and proximity queries whether used explicitly or implicitly in a ranking retrieval system. Jarrar 2014 38

Rules of thumb A positional index is 2 4 as large as a non-positional index Positional index size 35 50% of volume of original text Caveat: all of this holds for English-like languages Jarrar 2014 39

Combination Schemes (both Solutions) These two approaches can be profitably combined For particular phrases ( Michael Jackson, Britney Spears ) it is inefficient to keep on merging positional postings lists Even more so for phrases like The Who Williams et al. (2004) evaluate a more sophisticated mixed indexing scheme A typical web query mixture was executed in ¼ of the time of using just a positional index It required 26% more space than having a positional index alone Jarrar 2014 40

Project (Search Engine) Building an Arabic Search Engine based on a positional inverted index. A corpus of at least 10 documents containing at least 10000 words should be used to test this engine. The engine should support single, multiple word, as phrase queries. The inverted index should take into account Arabic stemming and stop words. The results page will list the titles of all relevant documents in a clickable manner. This project is a continuation of the previous project, thus students are expected to also allow users to auto complete their search queries. Deadline: 9/10/2014 Jarrar 2014 41

References [1] Dan Jurafsky: From Languages to Information notes http://web.stanford.edu/class/cs124 Jarrar 2014 42