Information Retrieval and Web Search

Similar documents
Information Retrieval. Information Retrieval and Web Search

Information Retrieval CS Lecture 01. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Modern information retrieval

CS 6320 Natural Language Processing

Information Retrieval and Web Search

Classic IR Models 5/6/2012 1

Information Retrieval. (M&S Ch 15)

In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most

CS105 Introduction to Information Retrieval

Introduction to Information Retrieval

modern database systems lecture 4 : information retrieval

Information Retrieval. Chap 8. Inverted Files

CSE 494/598 Lecture-2: Information Retrieval. **Content adapted from last year s slides

Information Retrieval and Web Search

Information Retrieval

Boolean retrieval & basics of indexing CE-324: Modern Information Retrieval Sharif University of Technology

Boolean retrieval & basics of indexing CE-324: Modern Information Retrieval Sharif University of Technology

Information Retrieval

Knowledge Retrieval. Franz J. Kurfess. Computer Science Department California Polytechnic State University San Luis Obispo, CA, U.S.A.

Modern Information Retrieval

Boolean retrieval & basics of indexing CE-324: Modern Information Retrieval Sharif University of Technology

Chapter 6: Information Retrieval and Web Search. An introduction

Overview of Information Retrieval and Organization. CSC 575 Intelligent Information Retrieval

Information Retrieval CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Unstructured Data Management. Advanced Topics in Database Management (INFSCI 2711)

Information Retrieval. CS630 Representing and Accessing Digital Information. What is a Retrieval Model? Basic IR Processes

Representation/Indexing (fig 1.2) IR models - overview (fig 2.1) IR models - vector space. Weighting TF*IDF. U s e r. T a s k s

Information Retrieval and Data Mining Part 1 Information Retrieval

Information Retrieval

Information Retrieval

Models for Document & Query Representation. Ziawasch Abedjan

Information Retrieval

Information Retrieval

CS54701: Information Retrieval

Document indexing, similarities and retrieval in large scale text collections

Information Retrieval

Information Retrieval

CS377: Database Systems Text data and information. Li Xiong Department of Mathematics and Computer Science Emory University

boolean queries Inverted index query processing Query optimization boolean model September 9, / 39

Outline of the course

Optimal Query. Assume that the relevant set of documents C r. 1 N C r d j. d j. Where N is the total number of documents.

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University

Recap: lecture 2 CS276A Information Retrieval

James Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence!

Chapter 27 Introduction to Information Retrieval and Web Search

Index construction CE-324: Modern Information Retrieval Sharif University of Technology

Indexing. Lecture Objectives. Text Technologies for Data Science INFR Learn about and implement Boolean search Inverted index Positional index

Multimedia Information Systems

Authoritative K-Means for Clustering of Web Search Results

Administrative. Distributed indexing. Index Compression! What I did last summer lunch talks today. Master. Tasks

60-538: Information Retrieval

Boolean Queries. Keywords combined with Boolean operators:

CSCI 5417 Information Retrieval Systems! What is Information Retrieval?

CSE 7/5337: Information Retrieval and Web Search Introduction and Boolean Retrieval (IIR 1)

Index construction CE-324: Modern Information Retrieval Sharif University of Technology

Introduction to Information Retrieval

Introducing Information Retrieval and Web Search. borrowing from: Pandu Nayak

Introduction to Information Retrieval

Advanced Retrieval Information Analysis Boolean Retrieval

Information Retrieval. hussein suleman uct cs

Representation of Documents and Infomation Retrieval

Introduction to Information Retrieval

Department of Electronic Engineering FINAL YEAR PROJECT REPORT

Index construction CE-324: Modern Information Retrieval Sharif University of Technology

CS347. Lecture 2 April 9, Prabhakar Raghavan

Information Retrieval

Information Retrieval

Information Retrieval CSCI

Reading group on Ontologies and NLP:

Today s topics CS347. Inverted index storage. Inverted index storage. Processing Boolean queries. Lecture 2 April 9, 2001 Prabhakar Raghavan

Glossary. ASCII: Standard binary codes to represent occidental characters in one byte.

Introduction to Information Retrieval and Boolean model. Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H.

UNICAL, 21/10/2004. Tutorial goals

Introduction to Information Retrieval

Modern Information Retrieval

Introduction to Information Retrieval

Information Retrieval (Part 1)

Multimedia Information Extraction and Retrieval Term Frequency Inverse Document Frequency

3-2. Index construction. Most slides were adapted from Stanford CS 276 course and University of Munich IR course.

CSCI 5417 Information Retrieval Systems Jim Martin!

Information Retrieval and Text Mining

Web Information Retrieval using WordNet

Introduction to Information Retrieval

INFSCI 2140 Information Storage and Retrieval Lecture 2: Models of Information Retrieval: Boolean model. Final Group Projects

Chapter 2. Architecture of a Search Engine

Information Retrieval

Information Retrieval (IR) Introduction to Information Retrieval. Lecture Overview. Why do we need IR? Basics of an IR system.

Part 2: Boolean Retrieval Francesco Ricci

Information Retrieval and Search Engine

Digital Libraries: Language Technologies

Information Retrieval

INFORMATION RETRIEVAL SYSTEM USING FUZZY SET THEORY - THE BASIC CONCEPT

Query Languages. Berlin Chen Reference: 1. Modern Information Retrieval, chapter 4

COMP6237 Data Mining Searching and Ranking

Relevance Feedback and Query Reformulation. Lecture 10 CS 510 Information Retrieval on the Internet Thanks to Susan Price. Outline

Unstructured Data. CS102 Winter 2019

Chapter 3 - Text. Management and Retrieval

Information Retrieval

Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data

Information Technology for Documentary Data Representation

Transcription:

Information Retrieval and Web Search Introduction to IR models and methods Rada Mihalcea (Some of the slides in this slide set come from IR courses taught at UT Austin and Stanford)

Information Retrieval The indexing and retrieval of textual documents. Searching for pages on the World Wide Web is the most recent and perhaps most widely used IR application Concerned firstly with retrieving relevant documents to a query. Concerned secondly with retrieving from large sets of documents efficiently.

Typical IR Task Given: A corpus of textual natural-language documents. A user query in the form of a textual string. Find: A ranked set of documents that are relevant to the query.

Typical IR System Architecture Document corpus Query String IR System Ranked Documents 1. Doc1 2. Doc2 3. Doc3..

Web Search System Web Spider Document corpus Query String IR System 1. Page1 2. Page2 3. Page3.. Ranked Documents

Relevance Relevance is a subjective judgment and may include: Being on the proper subject. Being timely (recent information). Being authoritative (from a trusted source). Satisfying the goals of the user and his/her intended use of the information (information need) Main relevance criterion: an IR system should fulfill the user s information need

Basic IR Approach: Keyword Search Simplest notion of relevance is that the query string appears verbatim in the document. Slightly less strict notion is that the words in the query appear frequently in the document, in any order (bag of words).

Problems with Keywords May not retrieve relevant documents that include synonymous terms. restaurant vs. café PRC vs. China May retrieve irrelevant documents that include ambiguous terms. bat (baseball vs. mammal) Apple (company vs. fruit) bit (unit of data vs. act of eating) In this course: We will cover the basics of keyword-based IR Also address more complex techniques for intelligent IR

Techniques for Intelligent IR Take into account the meaning of the words used Take into account the order of words in the query Adapt to the user based on automatic or semi-automatic feedback Extend search with related terms Perform automatic spell checking / diacritics restoration Take into account the authority of the source.

IR System Architecture User Need User Feedback Query Operations User Interface Text Operations Logical View Indexing Text Database Manager Query Ranked Docs Searching Ranking Index Retrieved Docs Inverted file Text Database

IR System Components Text Operations forms index words (tokens). Tokenization Stopword removal Stemming Indexing constructs an inverted index of word to document pointers. Mapping from keywords to document ids Doc 1 I did enact Julius Caesar I was killed i' the Capitol; Brutus killed me. Doc 2 So let it be with Caesar. The noble Brutus hath told you Caesar was ambitious Term Doc # I 1 did 1 enact 1 julius 1 caesar 1 I 1 was 1 killed 1 i' 1 the 1 capitol 1 brutus 1 killed 1 me 1 so 2 let 2 it 2 be 2 with 2 caesar 2 the 2 noble 2 brutus 2 hath 2 told 2 you 2 caesar 2 was 2 ambitious 2

IR System Components Searching retrieves documents that contain a given query token from the inverted index. Ranking scores all retrieved documents according to a relevance metric. User Interface manages interaction with the user: Query input and document output Relevance feedback Visualization of results Query Operations transform the query to improve retrieval: Query expansion using a thesaurus Query transformation using relevance feedback

IR Models Set Theoretic Classic Models Fuzzy Extended Boolean U s e r T a s k Retrieval: Adhoc Filtering Browsing boolean vector probabilistic Structured Models Non-Overlapping Lists Proximal Nodes Algebraic Generalized Vector Lat. Semantic Index Neural Networks Probabilistic Inference Network Belief Network Browsing Flat Structure Guided Hypertext

Classic IR Models - Basic Concepts Each document is represented by a set of representative keywords or index terms An index term is a document word that may be searched for Index terms may be selected to be only nouns, since nouns have meaning by themselves Should reduce the size of the index... But it requires the identification of nouns à Part of Speech tagger However, search engines assume that all words are index terms (full text representation)

Classic IR Models - Basic Concepts Not all terms are equally useful for representing the document contents: less frequent terms allow for the identification of a narrower set of documents The importance of the index terms is represented by weights associated to them Let k i be an index term d j be a document w ij is a weight associated with (k i,d j ) The weight w ij quantifies the importance of the index term to describe the document contents

Boolean Model Simple model based on set theory Queries specified as boolean expressions precise semantics neat formalism q = k a (k b k c ) Terms are either present or absent. Thus, w ij ε {0,1} Can always be transformed into DNF q dnf = (1,1,1) (1,1,0) (1,0,0)

Boolean Model q = k a (k b k c ) k a (1,0,0) (1,1,0) (1,1,1) k b k c

Drawbacks of the Boolean Model Retrieval based on binary decision criteria with no notion of partial matching No ranking of the documents is provided (absence of a grading scale) Information need has to be translated into a Boolean expression which most users find awkward The Boolean queries formulated by the users are most often too simplistic As a consequence, the Boolean model frequently returns either too few or too many documents in response to a user query

Vector-based Model Use of binary weights is too limiting Non-binary weights provide consideration for partial matches These term weights are used to compute a degree of similarity between a query and each document Ranked set of documents provides for better matching

Vector-based Model Define: w ij > 0 whenever k i d j w iq >= 0 associated with the pair (k i,q) vec(d j ) = (w 1j, w 2j,..., w tj ) vec(q) = (w 1q, w 2q,..., w tq ) Assign to each term k i a vector vec(i) The vectors vec(i) and vec(j) are assumed to be orthonormal (i.e., index terms are assumed to occur independently within the documents) The t vectors vec(i) form an orthonormal basis for a t- dimensional space In this space, queries and documents are represented as weighted vectors

Vector-based Model j dj Θ q Sim(q,dj) = cos(θ) = [vec(d j ) vec(q)] / d j * q = [Σ w ij * w iq ] / d j * q Since w ij > 0 and w iq > 0, 0 <= sim(q,d j ) <=1 i A document is retrieved even if it matches the query terms only partially

Vector-based Model Sim(q,d j ) = [Σ w ij * w iq ] / d j * q How to compute the weights w ij and w iq? A good weight must take into account two effects: quantification of intra-document contents tf factor, the term frequency within a document quantification of inter-documents separation idf factor, the inverse document frequency w ij = tf(i,j) * idf(i)

Probabilistic Model Objective: to capture the IR problem using a probabilistic framework Given a user query, there is an ideal answer set Guess at the beginning what that could be (i.e., guess initial description of ideal answer set) Improve by iteration

Probabilistic Model An initial set of documents is retrieved somehow Can be done using vector-space model, boolean model User inspects these docs looking for the relevant ones (in truth, only top 10-20 need to be inspected) IR system uses this information to refine description of ideal answer set By repeting this process, it is expected that the description of the ideal answer set will improve Have always in mind the need to guess at the very beginning the description of the ideal answer set Description of ideal answer set is modeled in probabilistic terms

Probabilistic Ranking Principle Given a user query q and a document d j, the probabilistic model tries to estimate the probability that the user will find the document d j interesting (i.e., relevant). The model assumes that this probability of relevance depends on the query and the document representations only.

Key Terms Used in IR QUERY: a representation of what the user is looking for - can be a list of words or a phrase. DOCUMENT: an information entity that the user wants to retrieve COLLECTION: a set of documents INDEX: a representation of information that makes querying easier TERM: word or concept that appears in a document or a query

Other Important Terms Classification Cluster Similarity Information Extraction Term Frequency Inverse Document Frequency Precision Recall Inverted File Query Expansion Relevance Relevance Feedback Stemming Stopword Vector Space Model Weighting TREC/TIPSTER/MUC