INFSCI 2140 Information Storage and Retrieval Lecture 2: Models of Information Retrieval: Boolean model. Final Group Projects

Similar documents
Multimedia Information Retrieval

Introduction to Information Retrieval

60-538: Information Retrieval

Chapter 2: Universal Building Blocks. CS105: Great Insights in Computer Science

Information Retrieval and Web Search

CSE 20 DISCRETE MATH. Fall

INFSCI 2140 Information Storage and Retrieval Lecture 6: Taking User into Account. Ad-hoc IR in text-oriented DS

Representation/Indexing (fig 1.2) IR models - overview (fig 2.1) IR models - vector space. Weighting TF*IDF. U s e r. T a s k s

Information Retrieval. CS630 Representing and Accessing Digital Information. What is a Retrieval Model? Basic IR Processes

CSE 20 DISCRETE MATH. Winter

Information Retrieval and Web Search

Computer Organization and Levels of Abstraction

Information Retrieval. Information Retrieval and Web Search

Query Languages. Berlin Chen Reference: 1. Modern Information Retrieval, chapter 4

Information Retrieval and Web Search Engines

COMP combinational logic 1 Jan. 18, 2016

Information Retrieval and Web Search Engines

INFSCI 2140 Information Storage and Retrieval Lecture 1: Introduction. INFSCI 2140 and your program

CHAPTER 5 Querying of the Information Retrieval System

Elementary IR: Scalable Boolean Text Search. (Compare with R & G )

Introduction to Computer Architecture

[Ch 6] Set Theory. 1. Basic Concepts and Definitions. 400 lecture note #4. 1) Basics

Information Retrieval

Object-Oriented Software Engineering Practical Software Development using UML and Java

Propositional Calculus. Math Foundations of Computer Science

Knowledge Retrieval. Franz J. Kurfess. Computer Science Department California Polytechnic State University San Luis Obispo, CA, U.S.A.

modern database systems lecture 4 : information retrieval

Computer Organization and Levels of Abstraction

Information Retrieval and Knowledge Organisation

Unstructured Data Management. Advanced Topics in Database Management (INFSCI 2711)

Discrete structures - CS Fall 2017 Questions for chapter 2.1 and 2.2

P -vs- NP. NP Problems. P = polynomial time. NP = non-deterministic polynomial time

Program Verification & Testing; Review of Propositional Logic

Chapter 27 Introduction to Information Retrieval and Web Search

8.3 Common Loop Patterns

ECE 550D Fundamentals of Computer Systems and Engineering. Fall 2017

CS Bootcamp Boolean Logic Autumn 2015 A B A B T T T T F F F T F F F F T T T T F T F T T F F F

Information Retrieval

Mathematical Logic Prof. Arindama Singh Department of Mathematics Indian Institute of Technology, Madras. Lecture - 9 Normal Forms

Some Hardness Proofs

CS40-S13: Functional Completeness

CS1 Lecture 22 Mar. 6, 2019

Formal Methods of Software Design, Eric Hehner, segment 1 page 1 out of 5

Definitions. Lecture Objectives. Text Technologies for Data Science INFR Learn about main concepts in IR 9/19/2017. Instructor: Walid Magdy

Overview of DB & IR. ICS 624 Spring Asst. Prof. Lipyeow Lim Information & Computer Science Department University of Hawaii at Manoa

CS54701: Information Retrieval

HOW DOES A SEARCH ENGINE WORK?

GC03 Boolean Algebra

In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most

BOOLEAN ALGEBRA AND CIRCUITS

A Security Model for Multi-User File System Search. in Multi-User Environments

Search Engine Architecture. Hongning Wang

Indexing and Query Processing. What will we cover?

Information Science 1

INFSCI 2480 Adaptive Information Systems Adaptive [Web] Search. Peter Brusilovsky.

Search engines. Børge Svingen Chief Technology Officer, Open AdExchange

Boolean Functions (Formulas) and Propositional Logic

An Evolution of Mathematical Tools

Multimedia Information Extraction and Retrieval Term Frequency Inverse Document Frequency

To prove something about all Boolean expressions, we will need the following induction principle: Axiom 7.1 (Induction over Boolean expressions):

School of Computer Science

Efficiency. Efficiency: Indexing. Indexing. Efficiency Techniques. Inverted Index. Inverted Index (COSC 488)

Mixed Integer Linear Programming

LECTURE 4. Logic Design

Information Retrieval

CS47300: Web Information Search and Management

To prove something about all Boolean expressions, we will need the following induction principle: Axiom 7.1 (Induction over Boolean expressions):

COMS 1003 Fall Introduction to Computer Programming in C. Bits, Boolean Logic & Discrete Math. September 13 th

Chapter 6: Information Retrieval and Web Search. An introduction

THE WEB SEARCH ENGINE

University of Virginia Department of Computer Science. CS 4501: Information Retrieval Fall 2015

(Refer Slide Time 6:48)

Lecture 5. Logic I. Statement Logic

The Information Retrieval Series. Series Editor W. Bruce Croft

Modern information retrieval

SAT Solver. CS 680 Formal Methods Jeremy Johnson

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2009 Lecture 12 Google Bigtable

Information Retrieval CS Lecture 01. Razvan C. Bunescu School of Electrical Engineering and Computer Science

(Refer Slide Time 3:31)

CS105 Introduction to Information Retrieval

Boolean Algebra & Digital Logic

Copyright 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley. Chapter 6 Outline. Unary Relational Operations: SELECT and

Object-Oriented Software Engineering Practical Software Development using UML and Java. Chapter 5: Modelling with Classes

Propositional Calculus: Boolean Algebra and Simplification. CS 270: Mathematical Foundations of Computer Science Jeremy Johnson

68A8 Multimedia DataBases Information Retrieval - Exercises

CS112 Lecture: Working with Numbers

Querying Introduction to Information Retrieval INF 141 Donald J. Patterson. Content adapted from Hinrich Schütze

Homework 1. Due Date: Wednesday 11/26/07 - at the beginning of the lecture

Henry Lin, Department of Electrical and Computer Engineering, California State University, Bakersfield Lecture 7 (Digital Logic) July 24 th, 2012

CS377: Database Systems Text data and information. Li Xiong Department of Mathematics and Computer Science Emory University

Introduction to Asymptotic Running Time Analysis CMPSC 122

Information Retrieval

Indexing. Lecture Objectives. Text Technologies for Data Science INFR Learn about and implement Boolean search Inverted index Positional index

Chapter 3 - Text. Management and Retrieval

Binary Values. CSE 410 Lecture 02

2. BOOLEAN ALGEBRA 2.1 INTRODUCTION

Information Retrieval

Fuzzy Logic. This amounts to the use of a characteristic function f for a set A, where f(a)=1 if the element belongs to A, otherwise it is 0;

Text Analytics (Text Mining)

CS112 Lecture: Variables, Expressions, Computation, Constants, Numeric Input-Output

Transcription:

INFSCI 2140 Information Storage and Retrieval Lecture 2: Models of Information Retrieval: Boolean model Peter Brusilovsky http://www2.sis.pitt.edu/~peterb/2140-051/ Final Group Projects Groups of variable size - match the project complexity Goal - learn more about IR, do some useful and practical work Educational IR system Exploring IR aspects Analysis 1

What are your strong sides? Are you good in: Reading and analyzing text Programming in C or Java Database programming Web programming Developing educational material Graphic arts and design Educational IR system Small educational programs: demo-simulation / assessment / ITS. Learn nuts and bolts of some complex aspect and help others to learn Here are some examples from the past semesters Boolean query Boolean document matching Evaluation measures Vectors, distances, similarities 2

Exploring IR aspects Any part of IR process to practice Document preparations and processing Simple search engine Similarity-based navigation Clustering Adaptive recommendation Possible domains C programming Information Retrieval Analysis Select a small field of interest on the frontier of IR, analyze a few relevant papers and systems, write an essay (indexed!), prepare a short talk for the last lecture Improving Web search Machine learning for IR Some type of information visualization Clustering Web recommenders... 3

Overview From an information need to documents Information Retrieval Models Boolean Model Boolean queries DNF and CNF Boolean matching From information need to documents Documents Surrogates Information need Query 4

Documents and information need IA Goal: user need documents that are relevant to information need Ways to find them depend on Is the information need explicit or implicit? Is it a one-time, iterative or continuous process? Do the documents form an unstructured pool or structured space? Explicit information need Explicit need is expressed in a query Searching and iterative searching Ad-hoc IR Information filtering and SDI Crawling and infobots Long-term information services (FAB) 5

Searching Documents Surrogates Information need Query Non-explicit information need Incremental search with relevance feedback Collaborative filtering Browsing Browsing / search combination Search/browsing - Golovchinsky Similarity navigation - Tudhope Area search while browsing - Lieberman 6

When / Why browsing? Hard to express an information need in a formal query Documents are connected to each other Humans aren t good in formulating requests, but good in recognizing Opportunity to navigate to similar documents Browsing Documents Surrogates Information need Query 7

Classic ad-hoc IR Documents form an unstructured document space Information need made explicit in a form of a query User searches once - one query against the document space (ad-hoc IR) The IR system returns an ordered subset of documents that are relevant to the query User s prospect Query a question in natural language a list of terms Matching The obvious exact match The range match Approximate match? ( tell me something about, tell me something that I don t know? ) 8

Problems with matching The fact that a document contains a term requested in a query doesn t mean that the document should be retrieved: we have to consider all the term and the condition expressed in the query it doesn t mean that the document is strongly related to the term a document can t be suitable for other reasons (age for example) Exact Match Examples all the documents authored by all the documents that have in the title the word It is possible with well structured documents and databases 9

Range Match Example all the documents authored from date1 to date2. Range match works with numerical databases Approximate match Necessary if the information need is not precise and if the database is not well organized. Sometimes it is necessary to combine the different kind or queries. 10

Designer s prospects for Q & M Can we consider query as a document? No let s consider a query as a characteristic function defined for document space Yes let s put a query into a space and calculate closeness or similarity In any case we need a measure for relevance ( closeness, similarity ) Document Space For our purposes we call document space the set of document representation Doc rep. Doc rep. Doc rep. Doc rep. representation Doc rep. Doc rep. 11

Document Space Document space is organized in some way. For example it could be possible to calculate a distance or a similarity between different documents Distance or similarity Doc rep. Doc rep. Doc rep. Doc rep. Doc rep. Doc rep. Models: Classic and New Boolean Model Classic Extended Fuzzy Vector Model Classic Others (generalized, LSI, Neural Networks) Probabilistic Model We will work by models, discussing Q&M for each model separately 12

The query is NOT a part of the document space The query defines an evaluation function: a function that have a value 0 or 1 depending if the document is relevant or is not relevant query Doc. representation {1 if relevant 0 if not relevant The query is NOT a part of the document space In more sophisticated systems the evaluation function can have values in the interval [0,1], allowing to rank documents. Doc. representation query 1 if relevant 0 if not relevant 13

Boolean Model Set Theoretic Approach Documents form a large set A query defines a subset (I.e., 0 or 1) Elementary query has a clearly defined subset To make a complex query one can use Boolean functions for set operation Boolean queries Boolean queries are based on Boolean Algebra Terms are join using connectives as AND, OR, NOT The search can be expanded using stemming, thesaurus or list of related terms Example: restaurants AND (mideastern OR vegetarian) AND inexpensive 14

Elementary Query Formatted fields Exact match (Year = 1999) Range match (Price < $50) Text fields Word inclusion (Programming in Title) Stemming/wildcards (Program$ in Title) Phrase inclusion ( Programming language in Title) Approximate match (sounds like John ) Boolean Operators Q1 AND Q2 Documents that are in BOTH sets: Q1 and Q2 Q1 OR Q2 Documents that are in at least in one set: Q1 or Q2 NOT Q1 All documents except the one in set Q1 15

Other Boolean Operators Q1 \ Q2 Logical minus all documents from Q1 except those that belong to Q2 Used also as binary NOT (Q1 NOT Q2) Q1 XOR Q2 Exclusive OR - documents that belong to exactly one set: Q1 or Q2, but not both In other words (Q1 OR Q2) \ (Q1 AND Q2) Some Special Operators Proximity (TNT) (Information within 1 word from Retrieval) This is really a special elementary query to structured text fields NOF (N of) 2 of (Sashimi, Sushi, Shabu-Shabu) This is a simple form that can be expressed by a regular query 16

Complex Queries We can use braces to determine the order of operation (NOT (Q1 OR Q2 OR Q3) AND (Q4 AND NOT Q5) OR Q6 Complex queries are hard to write, understand and perform Normalization - converting a query to one of the normal forms Normal Forms Every Boolean query can be recast into: Conjunctive Normal Form (CNF) Q1 AND Q2 AND Q3 AND Q4 Each Q is a disjunct Disjunctive Normal Form (DNF) Q1 OR Q2 OR Q3 OR Q4 Each Q is a conjunct 17

Disjunctive Normal Form (DNF) Term: positive or negated elementary fact Year = 1999; NOT (Director = Spielberg); Fire; NOT Retrieval Conjuncts: conjunction of terms Year = 1999 AND NOT (Director = Spielberg) Query: disjunction of conjuncts (a AND b) OR (c AND d) OR (NOT a AND d) Boolean queries: DNF The advantage is that the query can be split into many different queries. The results of the different queries is merged to produce the response to the original query: (a AND b) OR First query (c AND d) OR Second one (NOT a AND d) Third one 18

Conjunctive Normal Form (CNF) Term: positive or negated elementary fact Year = 1999; NOT (Director = Spielberg); Fire; NOT Retrieval Disjuncts: disjunction of terms Year = 1999 OR NOT (Director = Spielberg) Query: conjunction of disjuncts (a OR b) AND (c OR d) AND (NOT a OR d) Boolean queries: CNF Using a thesaurus can be easy to expand a CNF query to have a broader query: (a OR b) AND (c OR d) AND (NOT a OR d) List of terms List of terms List of terms 19

Normalization Convert each Boolean function to NFs Key: Truth Table Example: (A OR NOT B) AND C Normalization to DNF Consider T rows Normalization to DNF Consider F rows A B C not B Aor(notB) [Aor(notB)]and C 0 0 0 1 1 0 0 0 1 1 1 1 0 1 0 0 0 0 0 1 1 0 0 0 1 0 0 1 1 0 1 0 1 1 1 1 1 1 0 0 1 0 1 1 1 0 1 1 Example: Normalize the query [Aor(notB)] and C to DNF A B C not B Aor(notB) [Aor(notB)]and C 0 0 0 1 1 0 0 0 1 1 1 1 0 1 0 0 0 0 0 1 1 0 0 0 1 0 0 1 1 0 1 0 1 1 1 1 1 1 0 0 1 0 1 1 1 0 1 1 20

Example: Normalize the query [Aor(notB)] and C to DNF A B C not B Aor(notB) [Aor(notB)]and C 0 0 0 1 1 0 0 0 1 1 1 1 0 1 0 0 0 0 0 1 1 0 0 0 1 0 0 1 1 0 1 0 1 1 1 1 1 1 0 0 1 0 1 1 1 0 1 1 [(not A)and(notB)andC] or [A and (notb) and C] or [A and B and C] Normalization to CNF A way to obtain a CNF is to use the false (0) row of the table and simplifying them using DeMorgan s laws and double negation law DeMorgan s laws 1 st law: not(a and B) = (not A) or (not B) 2 nd law: not (A or B) = (not A) and (not B) double negation not(not A) = A 21

Example: Normalize the query [Aor(notB)] and C to CNF A B C not B Aor(notB) [Aor(notB)]and C 0 0 0 1 1 0 0 0 1 1 1 1 0 1 0 0 0 0 0 1 1 0 0 0 1 0 0 1 1 0 1 0 1 1 1 1 1 1 0 0 1 0 1 1 1 0 1 1 Example CNF Step 1: form the full DNF query using the false row of the table [(not A) and (not B) and (not C)] or [(not A) and B and (not C)] or [(not A) and B and C] or [A and (not B) and (not C)] or [A and B and (not C)] A B C not B Aor(notB) [Aor(notB)]and C 0 0 0 1 1 0 0 0 1 1 1 1 0 1 0 0 0 0 0 1 1 0 0 0 1 0 0 1 1 0 1 0 1 1 1 1 1 1 0 0 1 0 1 1 1 0 1 1 22

Example CNF step 2 : negate the whole DNF query not {[(not A) and (not B) and (not C)] or [(not A) and B and (not C)] or [(not A) and B and C] or [A and (not B) and (not C)] or [A and B and (not C)]} Example CNF step 3 : apply the 2 nd DeMorgan law not {[(not A) and (not B) and (not C)]} and not {[(not A) and B and (not C)] and not { [(not A) and B and C] } and not {[A and (not B) and (not C)] } and not {[A and B and (not C)]} 23

Example CNF step 4 : apply the 1 st DeMorgan law [not (not A) or not(not B) or not(not C)] and [not (not A) or (not B) or not (not C)] and [not (not A) or (not B) or (not C)] and [(not A) or not(not B) or not(not C)] } and [(not A) or (not B) or not(not C)] Example CNF step 5 : apply the double negation law [A or B or C] and [A or (not B) or C] and [A) or (not B) or (not C)] and [(not A) or B or C] } and [(not A) or (not B) or C] 24

Boolean Matching The query is not a part of the document space. It is a Boolean expression that represent some conditions on the index terms or representation of the desired documents (computer AND memory) AND (NOT chip) query Doc. representation {1 if relevant 0 if not relevant Boolean Matching How to define match for elementary query? How to define match for main boolean operations? Q1 AND Q2 Q1 OR Q2 NOT Q1 How to order? 25

Efficiency for elementary queries Fast ways to match Inverted Indexing (primary and secondary key): Fixed value search (year = 1999) Full term search (Adaptive in text) Hashing Range search (Year < 1990) Sorted array search Boolean queries and inverted file Doc1: the cat is on the mat Doc2: the mat is on the floor Inverted file cat:doc1,1 floor:doc2,5 mat:doc1,5;doc2,1 26

Boolean queries and inverted file Query : cat Inverted file cat:doc1,1 floor:doc2,5 mat:doc1,5;doc2,1 Answer : doc1 Boolean queries and inverted file Query: cat AND floor Inverted file cat:doc1,1 floor:doc2,5 mat:doc1,5;doc2,1 The algorithm looks at these two lists and try to found elements in common Answer : No document 27

Boolean queries and Termdocument matrix Rows represent document terms Columns represent documents Doc1: the cat is on the mat Doc2: the mat is on the floor cat floor mat Doc1 Doc2 1 0 0 1 1 1 Boolean queries and Termdocument matrix Query: Mat Doc1 Doc2 cat floor mat 1 0 1 0 1 1 The algorithm looks to this row Answer: Doc1 and Doc2 28

More efficiency Slower ways to match Full comparison Exact comparison (size < 5.9) Text matching Combination Do some fast search first, defining a smaller subset, then comprehensive search Matching The case of Exact Match (Boolean characteristic function) The result is either in or not Softening Boolean queries The case of A or B or C Document that satisfy A and B is better than A but worse than A and B and C 29

Homework (Part 1) Using the system at the URL (Boolean query servlet) http://kt1.sis.pitt.edu:8080/ir/uquery/matrix.html answer to the following questions (and report the query used): Did Mr. Flanagan write any book about algorithms? Answer to the question and report the query used Did. Prof. Brusilovsky write any book about Java? Answer to the question and report the query used List the books about Java or Perl language? (look at the title) Answer to the question and report the query used List the books related to Java were published in 1999? Answer to the question and report the query used List the books published by Java Soft were not authored by Flanagan? Answer to the question and report the query used Homework (Part 2) Take an example of an advanced online search system (the one you have used in HW1 or any other)retrieval systems and analyze it from the prospect of IR models presented at the lectures 3 and 4. What kind of model this system is based upon? Can you recognize one of the models we have learned? What kind of queries the system allows? How you could use boolean operators or perform Boolean search with this engine? What kinds of usual Boolean operators it has? Whay kind of proximity operators it uses? Does it provide a form-based search for users unprepared to work with complex boolean expressions? We have discussed all that (and a number of other issues) and now you should be able to apply your knowledge for thinking in a real world context. In addition to that, please, specify what other conditions you can specify in your search request to fine-tune the matching. Does the system enable you to restrict the search to a subset of documents or to consider for matching only a specific areas of documents? 30