Information Retrieval and Web Search

Similar documents
Information Retrieval. Information Retrieval and Web Search

Modern information retrieval

Information Retrieval and Web Search

Classic IR Models 5/6/2012 1

Information Retrieval

Modern Information Retrieval

Multimedia Information Retrieval

Duke University. Information Searching Models. Xianjue Huang. Math of the Universe. Hubert Bray

Introduction to Information Retrieval

In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most

ABSTRACT. VENKATESH, JAYASHREE. Pairwise Document Similarity using an Incremental Approach to TF-IDF. (Under the direction of Dr. Christopher Healey.

INFSCI 2140 Information Storage and Retrieval Lecture 2: Models of Information Retrieval: Boolean model. Final Group Projects

modern database systems lecture 4 : information retrieval

Information Retrieval. (M&S Ch 15)

Information Granulation and Approximation in a Decision-theoretic Model of Rough Sets

Adding Term Weight into Boolean Query and Ranking Facility to Improve the Boolean Retrieval Model

Boolean Model. Hongning Wang

[Ch 6] Set Theory. 1. Basic Concepts and Definitions. 400 lecture note #4. 1) Basics

What is all the Fuzz about?

Introduction to Fuzzy Logic. IJCAI2018 Tutorial

CHAPTER 5 Querying of the Information Retrieval System

Chapter 3 - Text. Management and Retrieval

Information Retrieval and Data Mining Part 1 Information Retrieval

Representation/Indexing (fig 1.2) IR models - overview (fig 2.1) IR models - vector space. Weighting TF*IDF. U s e r. T a s k s

The Information Retrieval Series. Series Editor W. Bruce Croft

HFCT: A Hybrid Fuzzy Clustering Method for Collaborative Tagging

Knowledge Retrieval. Franz J. Kurfess. Computer Science Department California Polytechnic State University San Luis Obispo, CA, U.S.A.

Lecture notes. Com Page 1

Relevance Feedback and Query Reformulation. Lecture 10 CS 510 Information Retrieval on the Internet Thanks to Susan Price. Outline

INFORMATION RETRIEVAL SYSTEM USING FUZZY SET THEORY - THE BASIC CONCEPT

A Decision-Theoretic Rough Set Model

CS446: Machine Learning Fall Problem Set 4. Handed Out: October 17, 2013 Due: October 31 th, w T x i w

Chapter 4 Fuzzy Logic

Information Retrieval. CS630 Representing and Accessing Digital Information. What is a Retrieval Model? Basic IR Processes

COSC 6397 Big Data Analytics. Fuzzy Clustering. Some slides based on a lecture by Prof. Shishir Shah. Edgar Gabriel Spring 2015.

GEOG 5113 Special Topics in GIScience. Why is Classical set theory restricted? Contradiction & Excluded Middle. Fuzzy Set Theory in GIScience

Granular Computing: A Paradigm in Information Processing Saroj K. Meher Center for Soft Computing Research Indian Statistical Institute, Kolkata

Information Retrieval: Retrieval Models

List of figures List of tables Acknowledgements

This lecture: IIR Sections Ranked retrieval Scoring documents Term frequency Collection statistics Weighting schemes Vector space scoring

Modern Information Retrieval

Fuzzy Set Theory and Its Applications. Second, Revised Edition. H.-J. Zimmermann. Kluwer Academic Publishers Boston / Dordrecht/ London

QUERY EXPANSION USING WORDNET WITH A LOGICAL MODEL OF INFORMATION RETRIEVAL

Glossary. ASCII: Standard binary codes to represent occidental characters in one byte.

FACILITY LIFE-CYCLE COST ANALYSIS BASED ON FUZZY SETS THEORY Life-cycle cost analysis

FUNDAMENTALS OF FUZZY SETS

A Comparison of Text Retrieval Models

Semantics of Fuzzy Sets in Rough Set Theory

Introduction to Fuzzy Logic and Fuzzy Systems Adel Nadjaran Toosi

CS490W. Text Clustering. Luo Si. Department of Computer Science Purdue University

Search engines. Børge Svingen Chief Technology Officer, Open AdExchange

VK Multimedia Information Systems

Boolean retrieval & basics of indexing CE-324: Modern Information Retrieval Sharif University of Technology

Web Information Retrieval using WordNet

Retrieval Evaluation

Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data

Introduction to Clustering

Instructor: Stefan Savev

IMPROVING INFORMATION RETRIEVAL USING A MORPHOLOGICAL NEURAL NETWORK MODEL

Query Languages. Berlin Chen Reference: 1. Modern Information Retrieval, chapter 4

VHDL framework for modeling fuzzy automata

Lecture 5. Logic I. Statement Logic

CSCI 599: Applications of Natural Language Processing Information Retrieval Retrieval Models (Part 1)"

Cluster Analysis. Ying Shen, SSE, Tongji University

Fuzzy Reasoning. Outline

Building Intelligent Learning Database Systems

Unsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi

Retrieval Evaluation. Hongning Wang

SINGLE VALUED NEUTROSOPHIC SETS

Knowledge Discovery and Data Mining 1 (VO) ( )

FUZZY INFERENCE. Siti Zaiton Mohd Hashim, PhD

Applying Fuzzy Sets and Rough Sets as Metric for Vagueness and Uncertainty in Information Retrieval Systems

vector space retrieval many slides courtesy James Amherst

CHAPTER 5 FUZZY LOGIC CONTROL

Automatic Document; Retrieval Systems. The conventional library classifies documents by numeric subject codes which are assigned manually (Dewey

Machine Learning & Statistical Models

Overview of Clustering

Algebraic Topology: A brief introduction

Boolean retrieval & basics of indexing CE-324: Modern Information Retrieval Sharif University of Technology

Vannevar Bush. Information Retrieval. Prophetic: Hypertext. Historic Vision 2/8/17

Document indexing, similarities and retrieval in large scale text collections

Homework: Exercise 1. Homework: Exercise 2b. Homework: Exercise 2a. Homework: Exercise 2d. Homework: Exercise 2c

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

FUZZY INFERENCE SYSTEMS

Boolean retrieval & basics of indexing CE-324: Modern Information Retrieval Sharif University of Technology

boolean queries Inverted index query processing Query optimization boolean model September 9, / 39

Decision Trees Dr. G. Bharadwaja Kumar VIT Chennai

Information Retrieval CS Lecture 01. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Modern Information Retrieval

Spectral Methods for Network Community Detection and Graph Partitioning

Information Retrieval CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science

FUZZY BOOLEAN ALGEBRAS AND LUKASIEWICZ LOGIC. Angel Garrido

APPLICATION OF THE FUZZY MIN-MAX NEURAL NETWORK CLASSIFIER TO PROBLEMS WITH CONTINUOUS AND DISCRETE ATTRIBUTES

Fuzzy Sets and Systems. Lecture 1 (Introduction) Bu- Ali Sina University Computer Engineering Dep. Spring 2010

Introduction to Information Retrieval

What is all the Fuzz about?

Information Retrieval

ARTIFICIAL INTELLIGENCE. Uncertainty: fuzzy systems

Notes on Fuzzy Set Ordination

XI International PhD Workshop OWD 2009, October Fuzzy Sets as Metasets

Transcription:

Information Retrieval and Web Search IR models: Boolean model IR Models Set Theoretic Classic Models Fuzzy Extended Boolean U s e r T a s k Retrieval: Adhoc Filtering Browsing boolean vector probabilistic Structured Models Non-Overlapping Lists Proximal Nodes Algebraic Generalized Vector Lat. Semantic Index Neural Networks Probabilistic Inference Network Belief Network Browsing Flat Structure Guided Hypertext Slide 1 Baili Zhang/ Southeast 1

The Boolean Model Simple model based on set theory Queries specified as boolean expressions precise semantics neat formalism q = ka (kb kc) Terms are either present or absent. Thus, wij {0,1} Consider q = ka (kb kc) vec(qdnf) = (1,1,1) (1,1,0) (1,0,0) Each query can be transformed in DNF form Slide 2 The Boolean Model q = ka (kb kc) Ka (1,0,0) (1,1,0) (1,1,1) Kb Kc sim(q,dj) = 1, if document satisfies the boolean query 0 otherwise - no in-between, only 0 or 1 Slide 3 Baili Zhang/ Southeast 2

Exercise D 1 = computer information retrieval D 2 = computer retrieval D 3 = information D 4 = computer information Q 1 = information retrieval Q 2 = information computer Slide 4 Exercise 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 ((chaucer OR milton) AND (NOT swift)) OR ((NOT chaucer) AND (swift OR shakespeare)) Slide 5 Baili Zhang/ Southeast 3

Drawbacks of the Boolean Model Retrieval based on binary decision criteria with no notion of partial matching No ranking of the documents is provided (absence of a grading scale) Information need has to be translated into a Boolean expression which most users find awkward The Boolean queries formulated by the users are most often too simplistic As a consequence, the Boolean model frequently returns either too few or too many documents in response to a user query Slide 6 The Boolean model imposes a binary criterion for deciding relevance The question of how to extend the Boolean model to accomodate partial matching and a ranking has attracted considerable attention in the past Two extensions of boolean model: Fuzzy Set Model Extended Boolean Model Slide 7 Baili Zhang/ Southeast 4

Extended Boolean Model Boolean model is simple and elegant. But, no provision for a ranking As with the fuzzy model, a ranking can be obtained by relaxing the condition on set membership Extend the Boolean model with the notions of partial matching and term weighting Combine characteristics of the Vector model with properties of Boolean algebra Slide 8 The Idea The extended Boolean model (introduced by Salton, Fox, and Wu, 1983) is based on a critique of a basic assumption in Boolean algebra Let, q = kx ky Use weights associated with kx and ky In boolean model: wx = 1; Or 0: all other documents are irrelevant Slide 9 Baili Zhang/ Southeast 5

The Idea ky q AND = kx ky; w xj = x and w yj = y (1,1) dj+1 AND y = w yj dj (0,0) x = w xj kx We want a document to be as close as possible to (1,1) Slide 10 The Idea ky q or = kx ky; w xj = x and w yj = y (1,1) dj+1 OR y = w yj dj (0,0) x = w xj kx We want a document to be as far as possible from (0,0) Slide 11 Baili Zhang/ Southeast 6

Generalizing the Idea We can extend the previous model to consider Euclidean distances in a t-dimensional space This can be done using p-norms which extend the notion of distance to include p-distances, where 1 p is a new parameter A generalized conjunctive query is given by qor = k1 pk2 p... pkt p km A generalized disjunctive query is given by qand = k1 p k2 p... p kt p km Slide 12 Generalizing the Idea If p = 1 then (similar to vectorial model) sim(q or,dj) = sim(q and,dj) = x1 +... + xm m Slide 13 Baili Zhang/ Southeast 7

Extended Boolean Model Model is quite powerful Properties are interesting and might be useful Computation is somewhat complex However, distributivity operation does not hold for ranking computation: q1 = (k1 k2) k3 q2 = (k1 k3) (k2 k3) sim(q1,dj) sim(q2,dj) Slide 14 Fuzzy Set Model Queries and docs represented by sets of index terms: matching is approximate from the start This vagueness can be modeled using a fuzzy framework, as follows: with each term is associated a fuzzy set each doc has a degree of membership in this fuzzy set This interpretation provides the foundation for many models for IR based on fuzzy theory In here, the model proposed by Ogawa, Morita, and Kobayashi (1991) Slide 15 Baili Zhang/ Southeast 8

Fuzzy Set Theory Framework for representing classes whose boundaries are not well defined Key idea is to introduce the notion of a degree of membership associated with the elements of a set This degree of membership varies from 0 to 1 and allows modeling the notion of marginal membership Thus, membership is now a gradual notion, contrary to the notion enforced by classic Boolean logic Slide 16 Fuzzy Set Theory Definition A fuzzy subset A of U is characterized by a membership function (A,u) : U [0,1] which associates with each element u of U a number (u) in the interval [0,1] Definition Let A and B be two fuzzy subsets of U. Also, let A be the complement of A. Then, ( A,u) = 1 - (A,u) (A B,u) = max( (A,u), (B,u)) (A B,u) = min( (A,u), (B,u)) Slide 17 Baili Zhang/ Southeast 9

Fuzzy Information Retrieval Fuzzy sets are modeled based on a thesaurus This thesaurus is built as follows: Let vec(c) be a term-term correlation matrix Let c(i,l) be a normalized correlation factor for (ki,kl): c(i,l) = n(i,l) ni + nl - n(i,l) - ni: number of docs which contain ki - nl: number of docs which contain kl - n(i,l): number of docs which contain both ki and kl We now have the notion of proximity among index terms. Slide 18 Fuzzy Information Retrieval The correlation factor c(i,l) can be used to define fuzzy set membership for a document dj as follows: (i,j) = 1 - (1 - c(i,l)) kl dj - (i,j) : membership of doc dj in fuzzy subset associated with ki The above expression computes an algebraic sum over all terms in the doc dj A doc dj belongs to the fuzzy set for ki, if its own terms are associated with ki If doc dj contains a term kl which is closely related to ki, we have c(i,l) ~ 1 (i,j) ~ 1 Slide 19 Baili Zhang/ Southeast 10

Fuzzy Information Retrieval Disjunctive set: algebraic sum (cc1 cc2 cc3, j) = 1 - (1 - (cc i, j)) Conjunctive set: algebraic product (cc1 & cc2 & cc3,j) = ( (cc i, j)) Slide 20 Fuzzy IR: An Example Ka cc3 cc2 cc1 Kb q = ka (kb kc) vec(qdnf) = (1,1,1) + (1,1,0) + (1,0,0) = vec(cc1) + vec(cc2) + vec(cc3) (q,dj) = (cc1+cc2+cc3,j) = 1 - (1 - (cc i, j)) = 1 - (1 - (a,j) (b,j) (c,j)) * (1 - (a,j) (b,j) (1- (c,j))) * (1 - (a,j) (1- (b,j)) (1- (c,j))) Kc Slide 21 Baili Zhang/ Southeast 11