Representation of Documents and Infomation Retrieval

Similar documents
Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval

Information Retrieval. (M&S Ch 15)

Information Retrieval and Web Search

Information Retrieval CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Introduction to Information Retrieval

Chapter 6: Information Retrieval and Web Search. An introduction

Optimal Query. Assume that the relevant set of documents C r. 1 N C r d j. d j. Where N is the total number of documents.

Information Retrieval. Information Retrieval and Web Search

Information Retrieval. Lecture 7

CS 6320 Natural Language Processing

CS377: Database Systems Text data and information. Li Xiong Department of Mathematics and Computer Science Emory University

Chapter III.2: Basic ranking & evaluation measures

Web Information Retrieval. Exercises Evaluation in information retrieval

Department of Electronic Engineering FINAL YEAR PROJECT REPORT

Information Retrieval

Information Retrieval

In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most

Natural Language Processing

Chapter 27 Introduction to Information Retrieval and Web Search

Task Description: Finding Similar Documents. Document Retrieval. Case Study 2: Document Retrieval

Manning Chapter: Text Retrieval (Selections) Text Retrieval Tasks. Vorhees & Harman (Bulkpack) Evaluation The Vector Space Model Advanced Techniques

Home Page. Title Page. Page 1 of 14. Go Back. Full Screen. Close. Quit

Recommender Systems 6CCS3WSN-7CCSMWAL

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University

CLUSTERING, TIERED INDEXES AND TERM PROXIMITY WEIGHTING IN TEXT-BASED RETRIEVAL

Using Semantic Similarity in Crawling-based Web Application Testing. (National Taiwan Univ.)

Web Information Retrieval using WordNet

Problem 1: Complexity of Update Rules for Logistic Regression

CSE 494/598 Lecture-2: Information Retrieval. **Content adapted from last year s slides

Introduction to Information Retrieval

Chapter 3 - Text. Management and Retrieval

Level of analysis Finding Out About Chapter 3: 25 Sept 01 R. K. Belew

Outline. Possible solutions. The basic problem. How? How? Relevance Feedback, Query Expansion, and Inputs to Ranking Beyond Similarity

CHAPTER 5 SEARCH ENGINE USING SEMANTIC CONCEPTS

James Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence!

Overview. Lecture 6: Evaluation. Summary: Ranked retrieval. Overview. Information Retrieval Computer Science Tripos Part II.

Relevance of a Document to a Query

modern database systems lecture 4 : information retrieval

Multimedia Information Systems

Plan for today. CS276B Text Retrieval and Mining Winter Vector spaces and XML. Text-centric XML retrieval. Vector spaces and XML

Information Retrieval. CS630 Representing and Accessing Digital Information. What is a Retrieval Model? Basic IR Processes

Information Retrieval

Document Representation : Quiz

Information Retrieval

Text Documents clustering using K Means Algorithm

Clustering (COSC 416) Nazli Goharian. Document Clustering.

Computer Science 572 Exam Prof. Horowitz Monday, November 27, 2017, 8:00am 9:00am

Semantic Search in s

Modern Information Retrieval

Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data

Information Retrieval and Web Search

The Security Role for Content Analysis

Search Evaluation. Tao Yang CS293S Slides partially based on text book [CMS] [MRS]

Relevance Feedback and Query Reformulation. Lecture 10 CS 510 Information Retrieval on the Internet Thanks to Susan Price. Outline

Using Coherence-based Measures to Predict Query Difficulty

Digital Libraries: Language Technologies

Data Modelling and Multimedia Databases M

Vannevar Bush. Information Retrieval. Prophetic: Hypertext. Historic Vision 2/8/17

Models for Document & Query Representation. Ziawasch Abedjan

Information Retrieval. Lecture 5 - The vector space model. Introduction. Overview. Term weighting. Wintersemester 2007

CS371R: Final Exam Dec. 18, 2017

Research Article Relevance Feedback Based Query Expansion Model Using Borda Count and Semantic Similarity Approach

Text Retrieval an introduction

Indexing and Query Processing. What will we cover?

Geographical Classification of Documents Using Evidence from Wikipedia

Introduction to Information Retrieval

Clustering (COSC 488) Nazli Goharian. Document Clustering.

Evaluating search engines CE-324: Modern Information Retrieval Sharif University of Technology

A Feature Selection Method to Handle Imbalanced Data in Text Classification

Representation/Indexing (fig 1.2) IR models - overview (fig 2.1) IR models - vector space. Weighting TF*IDF. U s e r. T a s k s

This lecture: IIR Sections Ranked retrieval Scoring documents Term frequency Collection statistics Weighting schemes Vector space scoring

Semantic Extensions to Syntactic Analysis of Queries Ben Handy, Rohini Rajaraman

CHAPTER 3 INFORMATION RETRIEVAL BASED ON QUERY EXPANSION AND LATENT SEMANTIC INDEXING

Natural Language Processing Basics. Yingyu Liang University of Wisconsin-Madison

Evaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München

Lecture 5: Evaluation

Boolean Queries. Keywords combined with Boolean operators:

Content-based Recommender Systems

Retrieval Evaluation. Hongning Wang

Improving Retrieval Experience Exploiting Semantic Representation of Documents

THIS LECTURE. How do we know if our results are any good? Results summaries: Evaluating a search engine. Making our good results usable to a user

Text Analytics (Text Mining)

Q: Given a set of keywords how can we return relevant documents quickly?

Information Retrieval and Web Search

Instructor: Stefan Savev

number of documents in global result list

Document indexing, similarities and retrieval in large scale text collections

COMP6237 Data Mining Searching and Ranking

Information Retrieval. hussein suleman uct cs

Query Answering Using Inverted Indexes

Overview of Information Retrieval and Organization. CSC 575 Intelligent Information Retrieval

Information Retrieval

Text Analytics (Text Mining)

Glossary. ASCII: Standard binary codes to represent occidental characters in one byte.

10/10/13. Traditional database system. Information Retrieval. Information Retrieval. Information retrieval system? Information Retrieval Issues

Boolean Model. Hongning Wang

Weight adjustment schemes for a centroid based classifier

CS54701: Information Retrieval

68A8 Multimedia DataBases Information Retrieval - Exercises

Chrome based Keyword Visualizer (under sparse text constraint) SANGHO SUH MOONSHIK KANG HOONHEE CHO

Transcription:

Representation of s and Infomation Retrieval Pavel Brazdil LIAAD INESC Porto LA FEP, Univ. of Porto http://www.liaad.up.pt Escola de verão Aspectos de processamento da LN F. Letras, UP, th June 9 Overview. Information retrieval (IR). Basic representation schemes 3. Binary representation Finding a similar document to a query. Representation tf-idf Cosine similarity 5. Relation of IR systems to k-nn 6. Some problems and remedies 6. Query completion / expansion 6. Ranking / clustering of retrieved documents 7. Evaluation of IR systems

. Information Retrieval Information Retrieval ~ Retrieval Retrieval of documents in response to a query document (as a special case, the query document can consist of a few keywords) Collection Query Matcher Retrieved s 3. Basic Representation Schemes Representation of documents using features representing: - Words Often the term word-level tokens is used instead. Tokens can be anotated (e.g. with labels representing noun, verb etc.). Bag-of-words representation exploits words, but the order is ignored. Word stem represents a group of related words stripped of a suffix - Terms may represent single words or multiword units, such as White House

Representations schemes Here we will consider: - binary representation, indicating the presence (or absence) of words, - representation using tf-idf In each case the document is represented using a vector of numbers. Therefore this representation is often referred to as a vector-space model. 5 3. Binary Representation Binary indicates presence / absence of words (terms). Example with 7 documents, each represented using terms: t t t3 t terms D D D3 D indicates absence of term t D5 D6 D7 indicates presence of term t 6

Retrieval : Finding a similar document to a query Binary representation can be used for document retrieval: - a query (query document) is represented in the same manner as other documents - most similar documents are identified using some proximity measure (PM), such as ( - Euclidian distance) / # terms t t t3 t PM D / D / D3 D D5 / / / One of the two most similar documents to the Query D6 / D7 / Query 3/ 7. Representation tf-idf The method of tf-idf is based on the following assumptions: - The term appearing often in the document may be more important for identification than the term appearing rarely. This aspect is captured by tf, term frequency in a document. - If a term appears in many documents, it will be probably irrelevant. This aspect is captured by idf, inverse document frequency. 8

Representation tf-idf Calculation of tf-idf of a term tj in document di: w tf-idf (tj,di) = tf (tj, di) * idf (tj) where tf (tj, di) is term frequency in a document. idf (tj) is inverse document frequency. Inverse document frequency is calculated: idf (tj) = log ( N / df(tj) ) where N is number of documents df(tj) is a number of documents in which term tj occurred 9 Cosine Measure of Similarity / Distance Cosine measure of similarity / distance is calculated by multiplying the weights of the shares words in two documents (this is similar to logical AND of the two vectors). cosine(di,dj) = Σ tj (wtf-idf(di,tj) * wtf-idf(dj,tj)) / (norm(di)*norm(dj)) norm(di) = Σ tj (wtf-idf(di,tj)

Example of a calculation of tf-idf Calculation of tf-idf using an example with 7 documents, each represented using terms. Calculation of tf: t t t3 t terms D D D3 D D5 D6 D7 Term frequencies Calculation of idf: Calculation of idf: t t t3 t D D D3 D D5 D6 D7 Q df(tj) 3 N/df(tj)...67. log(n/df(tj))....

Joining the two parts to obtain tf-idf Resulting w tf-idf values: t t t3 t Cos D..87 D.5 D3 D.5 D5..7 D6. D7.67 Q.. 3 Normalization constants The following table show the calculation of constants used in normalization of cosine similarity: t t t3 t sum sqrt norm D 3.73 D. D3. D. D5. D6 5. D7 5. Q.

5. Relation of IR systems to k-nn We note that a typical IR system, as described, is basically a k-nn method. After representing the documents as vectors, the system tries to identify the nearest match (or matches) to a query. 5 6. Some problems and remedies 6. Feature reduction 6. Query completion / expansion 6.3 Ranking / clustering retrieved documents 6

6. Feature Reduction Removal of high frequency words - stop words, Ex. removal of about, cannot etc. Advantages: simplifies the search process; reduces the size of text Suffix stripping and stemming (substitution of a word by a stem), Ex. substitution of moved by mov Unfortunately this can lead to errors. In general, it is necessary to take into account the context. Normalization of reduced words Ex. Substitution of mov by move (infinitive) This is usually done with the help of lexicon. 7 6. Query completions / expansion Problem: If the query document is very short, i.e. just a few words, the system may retrieve a large number of documents. Some of them will be irrelevant to the what the user intended. Solution (general idea): Complete / expand the query by adding more text. Specific methods that can be used: - Ask the user to expand the query; - Add synonyms to the text by exploiting e.g. Wordnet - Use others sources (e.g. Wikipedia) to expand the query; - Exploit an idea similar to active learning: Ask the user to identify the documents of interest in the retrieved set and use these documents (or some wrords) to expand the query. 8

6.3 Ranking / Clustering Retrieved s Problem: The system wil in general retrieve a large number of documents. Some of them will be irrelevant to the what the user intended. Query expansion helps, but does not resolve the problem. Solution: Provide a ranking of the retrieved documents: Assess how well each satisfies the given query using an appropriate distance / probabilistic measure. Order all documents using this measure and return top N items. Cluster the retrieved documents into groups and label them. Rank the documents in each group. The user may be interested in one of these groups only. 9 6. Evaluation of IR systems Effectiveness is usually measured in terms of: number of relevant items retrieved Precision = ------------------------------------------ total number of items retrieved number of relevant items retrieved Recall = ------------------------------------------------------- total number of relevant items in the collection Recall / Precision Curves

Evaluation of IR systems TP i Precision Pi = -------------- TPi + FPi TPi Recall Ri = ------------------ TPi + FNi (the index i represents class) T N T^ TP FP N^ FN TN F measure, (strictly speaking F), combines the two measures, giving similar weights to each (so called harmonic mean): *Pi*Ri F = Fi = ------------- = ----------- Pi + Ri /Pi + /Ri In general we could use: (β +)*Pi*Ri Fβi = ----------------- (in Fi parameter β = ) (β *Pi) + Ri Evaluation of IR systems Precision and recall can be combined in different ways for different classes. We distinguish: - Micro-averaged F - Macro-averaged F * (Pi*+ Pj)/ * (Ri + Ri)/ MacroF = ----------------------------------- (Pi*+ Pj)/ + (Ri + Ri)/

Evaluation of IR systems TREC.. Text Retrieval Conference Competition There were TREC -5 competitions (e.g. TREC was in 995) 3