Information Retrieval Overview

Similar documents
Information Retrieval and Web Search

Information Retrieval. Information Retrieval and Web Search

Chapter 6: Information Retrieval and Web Search. An introduction

Information Retrieval (IR) Introduction to Information Retrieval. Lecture Overview. Why do we need IR? Basics of an IR system.

modern database systems lecture 4 : information retrieval

COMP6237 Data Mining Searching and Ranking

Information Retrieval. CS630 Representing and Accessing Digital Information. What is a Retrieval Model? Basic IR Processes

Information Retrieval. (M&S Ch 15)

Information Retrieval

Chapter 27 Introduction to Information Retrieval and Web Search

Information Retrieval

Representation/Indexing (fig 1.2) IR models - overview (fig 2.1) IR models - vector space. Weighting TF*IDF. U s e r. T a s k s

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University

Keyword search in relational databases. By SO Tsz Yan Amanda & HON Ka Lam Ethan

Instructor: Stefan Savev

James Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence!

Keyword Search in Databases

Information Retrieval CS Lecture 01. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Session 10: Information Retrieval

Boolean Model. Hongning Wang

Multimedia Information Systems

Information Retrieval

10/10/13. Traditional database system. Information Retrieval. Information Retrieval. Information retrieval system? Information Retrieval Issues

Home Page. Title Page. Page 1 of 14. Go Back. Full Screen. Close. Quit

In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most

Information Retrieval CSCI

Introduction to Information Retrieval. Lecture Outline

Information Retrieval: Retrieval Models

CS54701: Information Retrieval

vector space retrieval many slides courtesy James Amherst

CS 6320 Natural Language Processing

Introduction to Information Retrieval

CS490W. Text Clustering. Luo Si. Department of Computer Science Purdue University

Multimedia Information Extraction and Retrieval Term Frequency Inverse Document Frequency

Mining Web Data. Lijun Zhang

Roadmap. Roadmap. Ranking Web Pages. PageRank. Roadmap. Random Walks in Ranking Query Results in Semistructured Databases

Information Retrieval CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science

CS377: Database Systems Text data and information. Li Xiong Department of Mathematics and Computer Science Emory University

Chapter 3 - Text. Management and Retrieval

Information Retrieval

Modern Information Retrieval

Information Retrieval. hussein suleman uct cs

This lecture: IIR Sections Ranked retrieval Scoring documents Term frequency Collection statistics Weighting schemes Vector space scoring

ΕΠΛ660. Ανάκτηση µε το µοντέλο διανυσµατικού χώρου

Information Retrieval and Web Search Engines

Introduction to Information Retrieval

A System for Query-Specific Document Summarization

Information Retrieval

Models for Document & Query Representation. Ziawasch Abedjan

Mining Web Data. Lijun Zhang

Natural Language Processing

INFORMATION RETRIEVAL SYSTEM USING FUZZY SET THEORY - THE BASIC CONCEPT

Information Retrieval and Organisation

Information Retrieval Lecture 4: Web Search. Challenges of Web Search 2. Natural Language and Information Processing (NLIP) Group

Information Retrieval and Web Search Engines

Informa/on Retrieval. Text Search. CISC437/637, Lecture #23 Ben CartereAe. Consider a database consis/ng of long textual informa/on fields

Chapter 2. Architecture of a Search Engine

Relevance Feedback and Query Reformulation. Lecture 10 CS 510 Information Retrieval on the Internet Thanks to Susan Price. Outline

Text Analytics (Text Mining)

VALLIAMMAI ENGINEERING COLLEGE SRM Nagar, Kattankulathur DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING QUESTION BANK VII SEMESTER

Fall CS646: Information Retrieval. Lecture 2 - Introduction to Search Result Ranking. Jiepu Jiang University of Massachusetts Amherst 2016/09/12

Relational Approach. Problem Definition

Outline. Possible solutions. The basic problem. How? How? Relevance Feedback, Query Expansion, and Inputs to Ranking Beyond Similarity

Vannevar Bush. Information Retrieval. Prophetic: Hypertext. Historic Vision 2/8/17

Text Analytics (Text Mining)

Intranet Search. Exploiting Databases for Document Retrieval. Christoph Mangold Universität Stuttgart

Web Information Retrieval using WordNet

60-538: Information Retrieval

Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p.

Outline of the course

Introduction. What do you know about web in general and web-searching in specific?

Bruno Martins. 1 st Semester 2012/2013

Knowledge Retrieval. Franz J. Kurfess. Computer Science Department California Polytechnic State University San Luis Obispo, CA, U.S.A.

INFSCI 2140 Information Storage and Retrieval Lecture 2: Models of Information Retrieval: Boolean model. Final Group Projects

Midterm Exam Search Engines ( / ) October 20, 2015

Contents 1. INTRODUCTION... 3

Informa(on Retrieval

Modern information retrieval

Unstructured Data. CS102 Winter 2019

DIRA : A FRAMEWORK OF DATA INTEGRATION USING DATA QUALITY

Department of Electronic Engineering FINAL YEAR PROJECT REPORT

Outline. Lecture 2: EITN01 Web Intelligence and Information Retrieval. Previous lecture. Representation/Indexing (fig 1.

Optimizing Search Engines using Click-through Data

Similarity search in multimedia databases

SYSTEMS FOR NON STRUCTURED INFORMATION MANAGEMENT

Relational Approach. Problem Definition

Outline. Structures for subject browsing. Subject browsing. Research issues. Renardus

Part I: Data Mining Foundations

CHAPTER THREE INFORMATION RETRIEVAL SYSTEM

CSCI 599: Applications of Natural Language Processing Information Retrieval Retrieval Models (Part 1)"

Keyword query interpretation over structured data

Keyword query interpretation over structured data

DATA MINING - 1DL105, 1DL111

Introduction to Information Retrieval

Knowledge Discovery and Data Mining 1 (VO) ( )

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2009 Lecture 12 Google Bigtable

University of Virginia Department of Computer Science. CS 4501: Information Retrieval Fall 2015

Today s topic CS347. Results list clustering example. Why cluster documents. Clustering documents. Lecture 8 May 7, 2001 Prabhakar Raghavan

Lecture 5: Information Retrieval using the Vector Space Model

Information Retrieval

Transcription:

Roadmap Information Retrieval Overview Vagelis Hristidis School of Computer Science Florida International University COP 6727 What is IR? Matching Models Evaluation of Results Digital Libraries vs. IR Bridging IR + Databases Proximity Search in Databases [Goldman et al.] 9/14/2004 FIU, COP 6727 2 What IR Systems Try to Do Predict, on the basis of some information about the user, and information about the knowledge resource, what information objects are likely to be the most appropriate for the user to interact with, at any particular time How IR Systems Try to Do This Represent the user s information problem (the query) Represent (surrogate) and organize (classify) the contents of the knowledge resource Compare query to surrogates (predict relevance) Present results to the user for interaction/judgment 9/14/2004 FIU, COP 6727 3 9/14/2004 FIU, COP 6727 4 Why IR is Difficult Document & Query People cannot specify what they don t know (Anomalous State of Knowledge), so representation of information problem is inherently uncertain Information objects can be about many things, so representation of aboutness is inherently incomplete Document Side generate data document transform internal representation match Query Side information need generate query transform internal representation match Various structures that have been proposed and used for queries to a retrieval systems 9/14/2004 FIU, COP 6727 5 9/14/2004 FIU, COP 6727 6 1

Documents The document and the query undergo parallel processes within the retrieval system. Roadmap data data data data data Information Need ectosystem gather document transform internal representation generate transform internal query representation endosystem transform transform format to use in matching process Matching Process format to use in matching process What is IR? Matching Models Evaluation of Results Digital Libraries vs. IR Bridging IR + Databases Proximity Search in Databases [Goldman et al.] ectosystem endosystem 9/14/2004 FIU, COP 6727 7 9/14/2004 FIU, COP 6727 8 Matching Criteria Boolean Queries An exact match can only be found in special situations requires precise query numerical or business applications Range Match works best in a DB with defined fields Approximate Matching Matching techniques can be combined eg, begin with approximate, narrow down with exact or range Based on concepts from logic: AND, OR, NOT Order of operations (two conventions) NOT, AND, OR left to right Standard forms Disjunctive Normal Form (DNF) Terms, Conjuncts, Disjuncts (P AND Q) OR (Q AND NOT R) OR (P AND R) Conjunctive Normal Form (CNF) Terms, Disjuncts, Conjuncts P AND (NOT Q OR R) AND (S OR NOT R) 9/14/2004 FIU, COP 6727 9 9/14/2004 FIU, COP 6727 10 Truth Table P Q NOT P P AND Q P OR Q 0 0 TRUE FALSE FALSE 0 1 TRUE FALSE TRUE 1 0 FALSE FALSE TRUE 1 1 FALSE TRUE TRUE 9/14/2004 FIU, COP 6727 11 Boolean-based Matching Separate the documents containing a given term from those that do not. No similarity between document and query structure Proximity Judgement : Gradations of the retrieved set adventure agriculture bridge cathedrals disasters flags horiculture leprosy Mediterranean recipes Terms scholarships tennis Venus 0 0 1 1 0 0 0 0 1 1 0 0 0 0 1 1 0 0 0 0 0 0 0 1 1 0 1 0 1 0 1 0 0 1 0 0 0 0 1 1 1 0 0 0 1 1 0 0 0 0 1 0 Queries flags AND tennis leprosy AND tennis Venus OR (tennis AND flags) (bridge OR flags) AND tennis 9/14/2004 FIU, COP 6727 12 2

Exact Match IR Vector Queries Advantages Efficient Boolean queries capture some aspects of information problem structure Disadvantages Not effective Difficult to write effective queries No inherent document ranking Documents and Queries are vectors of terms Actual vectors have many terms (thousands) Vectors can be Boolean (keyword) or weighted (term frequencies) Example terms: dog, cat, house, sink, road, car Boolean: (1,1,0,0,0,0), (0,0,1,1,0,0) Weighted: (0.01,0.01, 0.002, 0.0,0.0,0.0) Queries can be weighted also* 9/14/2004 FIU, COP 6727 13 9/14/2004 FIU, COP 6727 14 Vector-based Matching: Metrics Metric or Distance Measure : document close together in the vector space are likely to be highly similar Term 2 ( weather ) Term 1 ( internet ) Axes represent terms Documents are represented as points in the vector space Similarities among documents Similarities between documents and queries Vector-based Matching: Cosine Cosine of the angle between the vectors representing the document and the query Documents in the same direction are closely related. Transforms the angular measure into a measure ranging from 1 for the highest similarity to 0 for the lowest C A B D 9/14/2004 FIU, COP 6727 15 9/14/2004 FIU, COP 6727 16 Example, continued Queries Document A: A dog and a cat. a and Vector: (2,1,1,1,0) Document B: A frog. 2 1 cat 1 dog 1 frog 0 Queries can be represented as vectors in the same way as documents: Dog = (0,0,0,1,0) Frog = ( ) Dog and frog = ( ) Vector: (1,0,0,0,1) a 1 and 0 cat 0 dog 0 frog 1 9/14/2004 FIU, COP 6727 17 9/14/2004 FIU, COP 6727 18 3

Similarity measures The cosine measure There are many different ways to measure how similar two documents are, or how similar a document is to a query The cosine measure is a very common similarity measure Using a similarity measure, a set of documents can be compared to a query and the most similar document returned For two vectors d and d the cosine similarity between d and d is given by: d d' d d' Here d X d is the vector product of d and d, calculated by multiplying corresponding frequencies together The cosine measure calculates the angle between the vectors in a high-dimensional virtual space 9/14/2004 FIU, COP 6727 19 9/14/2004 FIU, COP 6727 20 Example Let d = (2,1,1,1,0) and d = (0,0,0,1,0) dxd = 2X0 + 1X0 + 1X0 + 1X1 + 0X0=1 d = (2 2 +1 2 +1 2 +1 2 +0 2 ) = 7=2.646 d = (0 2 +0 2 +0 2 +1 2 +0 2 ) = 1=1 Similarity = 1/(1 X 2.646) = 0.378 Let d = (1,0,0,0,1) and d = (0,0,0,1,0) Similarity = 0 Vector Space Model Advantages Straightforward ranking Simple query formulation (bag of words) Intuitively appealing Effective Disadvantages Unstructured queries Effective calculations and parameters must be empirically determined 9/14/2004 FIU, COP 6727 21 9/14/2004 FIU, COP 6727 22 Fuzzy Queries Probabilistic Queries Fuzzy Logic: Propositions have a truth value between 0 and 1 Fuzzy NOT: 1-t Fuzzy AND: t 1 *t 2 Fuzzy OR: 1-(1-t 1 )*(1-t 2 ) Example: All swans are white 0.8 1 All swans can swim 0.9 1 White and swim 0.72 1 White or swim 0.92 1 Like fuzzy queries, except they adhere to the laws of probability Can use probabilistic concepts like Baye s Theorem Term frequency data can be used to estimate probabilities 9/14/2004 FIU, COP 6727 23 9/14/2004 FIU, COP 6727 24 4

Natural Language Queries Vocabulary The Holy Grail of information retrieval Issues in Natural Language Processing syntax semantics pragmatics speech understanding speech generation Stopword lists Commonly occurring words are unlikely to give useful information and may be removed from the vocabulary to speed processing Stopword lists contain frequent words to be excluded Stopword lists need to be used carefully E.g. to be or not to be 9/14/2004 FIU, COP 6727 25 9/14/2004 FIU, COP 6727 26 Term weighting Normalised term frequency (tf) Not all words are equally useful A word is most likely to be highly relevant to document A if it is: Infrequent in other documents Frequent in document A A is short The cosine measure needs to be modified to reflect this A normalised measure of the importance of a word to a document is its frequency, divided by the maximum frequency of any term in the document This is known as the tf factor. Document A: raw frequency vector: (2,1,1,1,0), tf vector: ( ) This stops large documents from scoring higher 9/14/2004 FIU, COP 6727 27 9/14/2004 FIU, COP 6727 28 Inverse document frequency (idf) tf-idf A calculation designed to make rare words more important than common words The idf of word i is given by idf i = log N n Where N is the number of documents and n i is the number that contain word i i The tf-idf weighting scheme is to multiply each word in each document by its tf factor and idf factor Different schemes are usually used for query vectors Different variants of tf-idf are also used 9/14/2004 FIU, COP 6727 29 9/14/2004 FIU, COP 6727 30 5

Document Scoring Functions Missing Terms & Term Relationships Vector space model problem 0 value is used in 2 ways : indicate terms that are truly missing indicate terms about which there is no information Problem : relationships among the terms in a document or query. Linearly independent set of basis vector 9/14/2004 FIU, COP 6727 31 9/14/2004 FIU, COP 6727 32 Probabilistic Matching Given a document and a query it should be possible to calculate the probability that the document is relevant to the query. Assumption : the number of document within the database that are relevant to the query is known. Discriminant function (dis(selected) > 1 ) Much calculation and many assumption Good results, but not better than those obtained using Boolean or vector model Probabilistic IR Advantages Straightforward relevance ranking Simple query formulation Sound mathematical/theoretical model Effective Disadvantages Unrealistic assumptions (term independence) Probabilities difficult to estimate 9/14/2004 FIU, COP 6727 33 9/14/2004 FIU, COP 6727 34 Fuzzy Matching Proximity Matching Replaces the need to estimate probabilities by a need to estimate a sense of belief about a document relevant. Fuzzy matching : calculation based on defined membership grades for terms how well a related term matches a given term. Modifiers or descriptors Problem : how such terms translate into the membership function associated with fuzzy retrieval Proximity Criteria Additional criteria to further refine the set of document identified by one of the other matching methods. Modification of proximity criteria use phrases rather than simple word proximity ordered proximity to aid in the retrieval decision 9/14/2004 FIU, COP 6727 35 9/14/2004 FIU, COP 6727 36 6

Roadmap Evaluation of IR Systems What is IR? Matching Models Evaluation of Results Digital Libraries vs. IR Bridging IR + Databases Proximity Search in Databases [Goldman et al.] Traditional goal of IR is to retrieve all and only the relevant documents in response to a query All is measured by recall: the proportion of relevant documents in the collection which are retrieved Only is measured by precision: the proportion of retrieved documents which are relevant 9/14/2004 FIU, COP 6727 37 9/14/2004 FIU, COP 6727 38 Evaluation Problems Realistic IR is interactive; traditional IR methods and measures are based on noninteractive situations Evaluating interactive IR requires human subjects; the normal mode of evaluation is comparison between two systems (no gold standard or benchmarks); cannot compare a subject s searching on the same task in two systems How Interaction Has Been Accounted For Relevance feedback Automatically moving the initial query toward the ideal query Term reweighting and query expansion Support for query modification Display of good and bad terms Thesauri Inter-document relations 9/14/2004 FIU, COP 6727 39 9/14/2004 FIU, COP 6727 40 Roadmap What is a Digital Library (DL)? What is IR? Matching Models Evaluation of Results Digital Libraries vs. IR Bridging IR + Databases Proximity Search in Databases [Goldman et al.] a collection of information that is both digitized and organized (Lesk, p. 1) there are any number of alternate definitions, but this seems fair enough no mention of architecture, implementation, content, etc. 9/14/2004 FIU, COP 6727 41 9/14/2004 FIU, COP 6727 42 7

How is a DL different from a database? A traditional SQL database has as its basic element data items in a relation: select name from employee, project where employee.deptnumber = 25 AND project.number = 100 databases exploit known structures and relations DBMS retrieval is not probabilistic (Frakes, Baeza-Yates, p. 3) How is a DL different from traditional IR systems? The difference is less clear IR systems can be considered the precursors to DLs The basic unit of a IR system is a document and the focus is on textual retrieval exact matching - Boolean, text pattern searching inexact matching - probabilistic, vector space, clustering 9/14/2004 FIU, COP 6727 43 9/14/2004 FIU, COP 6727 44 How is a DL different from the WWW? The key difference is organization The WWW as a whole has no real organization Recently, convergence as search engines (Google) attempt to add an organizational framework to their web holdings In the past, most are focused on keyword searching (i.e., Altavista) How is a DL different from the WWW? Another key difference is who controls the input into the system most meta searchers hunt down their holdings Lycos is short for Lycosidae lycosa (the wolf spider ), which pursues its prey and does not build a web (Mauldin, IEEE Expert, 1/97) some (Yahoo) have humans in the loop for review and classification To date, DLs are generally more tightly controlled, and have a targeted customer set 9/14/2004 FIU, COP 6727 45 9/14/2004 FIU, COP 6727 46 DL = Content + Services digital library = collection of information both digitized and organized -- M. Lesk, 1997 Vector and/or Boolean Search Engines (traditional IR) non-www WWW (http) Access Access (most common) (now uncommon) Digital Library Services (searching, browsing, citation anlaysis usage analysis, alerts) RDBMS File Systems Other Technologies DL is the union of the content and services defined on the content Roadmap What is IR? Matching Models Evaluation of Results Digital Libraries vs. IR Bridging IR + Databases Proximity Search in Databases [Goldman et al.] Content 9/14/2004 FIU, COP 6727 47 9/14/2004 FIU, COP 6727 48 8

VH4 VH5 Database Search vs. Information Retrieval (Document Search) Database Search Data Stored in Structured Data Types (Tables, XML) and Conform to Schema idpowerful, date name Structured Query id state 4 5/4/02 Abc Languages (SQL, 4 CA XQuery) 6 5/6/07 cde 6 NY {o51, o24} SELECT orderid FROM customers C, orders Answer: O, suppliers Unordered S Set of WHERE Rows, C.custname= Miller XML ElementsAND S.supplname= Smith AND C.custid=O.custid AND O.suppid=S.suppid Information Retrieval Data is Set of Documents Keyword Expressions Smith AND Miller Doc1 0.9 Doc3 0.7 Doc2 0.5 Answer: Ranked List of Documents 9/14/2004 FIU, COP 6727 49 VH6 id 4 6 Bridging the Gap Between Databases and Information Retrieval Id 4 6 date 5/4/02 5/6/07 state CA NY name Abc Cde Relational Databases id 4 6 date 5/4/02 5/6/07 Databases+Text/XML Keyword Proximity Search Authority-Based Search descr P2 P2 P4 P4 P1 P1 Web P3 P3 Information Retrieval Data Storage 9/14/2004 FIU, COP 6727 50 VH7 Goal Currently, Information Discovery in Databases Requires: Knowledge of the Role of the Keywords Knowledge of Schema Knowledge of a Query Language Miller Smith 1. SELECT * FROM customers,orders,suppliers WHERE Enable IR-like custname= Miller Keyword Search AND supplname= Smith over Databases AND Without the Above Requirements Structured Queries (SQL, XQuery) 2. SELECT * FROM customers,orders,suppliers WHERE custname= Smith AND supplname= Miller AND 3. Query Type Semistructured Keyword Search Queries VH8 Database Graph Application Specific Node can be attribute value, tuple, group of tuples Edges are semantic connections Primary to foreign keys XML edges 9/14/2004 FIU, COP 6727 51 9/14/2004 FIU, COP 6727 52 Roadmap What is IR? Matching Models Evaluation of Results Digital Libraries vs. IR Bridging IR + Databases Proximity Search in Databases [Goldman et al.] Proximity Search in Databases [VLDB1998] Find Set Near Set E.g., Find Movie near Travolta Cage Near set = {Travolta, Cage} Find set =? 9/14/2004 FIU, COP 6727 53 9/14/2004 FIU, COP 6727 54 9

Slide 49 VH4 VH5 which require that we specify the types of connections between keywords Vagelis, 2/14/2004 SQL has logical underpinnings, which require that each tuple is either in or out of the result. Vagelis, 2/13/2004 Slide 50 VH6 Databases and IR have followed distinct research ways, since they were considered fundamentally different. However in the last years, we have witnessed a convergence between the two areas. From the rigidly structured relational databases with atomic attributes, we moved to databases with plain text attributes and to semistructured data like XML which combine plain text with structured elements. On the other end from IR we moved to Web Search, where there are links between the documents (pages). However, the information discovery techniques have remained seperated. In an effort to bridge this gap, we carry two well-studied ideas of IR and Web search to database. In particular, keyword proximity search discovers connections between keywords, and authority-based searc adapts the idea of PageRank to databases. Vagelis, 2/22/2004 Slide 51 VH7 Let's see which are the requirements for searching a db. Suppose we want to discover how Smith and Miller are associated. To do so using traditional discovery techniques we need to know the role of the keywords (if Smith is a customer or supplier etc), the schema (eg: customers are connected with the suppliers through the orders relation), and a query language (SQL). If we don't know how the keywords are connected, we would have to write a large number of SQL queries like... VH8 Our works enables... Vagelis, 2/22/2004 structured and keyword queries are the two extremes of the query type axis. In the middle lie semistructured queries which are part of my future work and will not be discussed in this presentation. Vagelis, 2/22/2004

Example Movie Database Example (cont d) 9/14/2004 FIU, COP 6727 55 9/14/2004 FIU, COP 6727 56 Example (cont d) Ranking Function Bond between nodes f, n b(f,n) = r F (f)r N (n)/d(f,n) t r N, r F : ranking in set N, F d: distance Score(f) = Sum n N (b(f,n)) Or max( ) 9/14/2004 FIU, COP 6727 57 9/14/2004 FIU, COP 6727 58 Execution Hub indexing Dijkstra Efficient for in memory only Precompute all paths (Floyd-Warshall) Inefficient in time and space No way to prune for distance>k Present algorithm to compute all-pairs distances efficient for graphs stored on disks Still too much space! Hub index consists of: Hub set H and shortest distances between them Distances between pairs of objects not crossing through H Algorithm to efficiently answer query using the hub index Hub set is nodes with highest degree (heuristic) 9/14/2004 FIU, COP 6727 59 9/14/2004 FIU, COP 6727 60 10

Some References Some slides have been taken from: Michael L. Nelson, ODU Paul Minro Nicholas J. Belkin, Rutgers Peter Burden Goldman et al. Proximity Search in Databases. VLDB 1998 9/14/2004 FIU, COP 6727 61 11